Visualization of publically available papers presented at ICCV 2013
Hover over a node to see the paper title. Click on a color to only show papers connected to that cluster. Zoom and move around with normal map controls.
Papers are linked together based on TF-IDF similarity and are colored using their predicted topic index.
Toggle the topics below to sort by category. The top 10 words from each cluster are shown.
|
Compensating for Motion during Direct-Global Separation [pdf]
Supreeth Achar, Stephen T. Nuske, Srinivasa G. Narasimhan |
|
Abstract: Separating the direct and global components of radiance can aid shape recovery algorithms and can provide useful information about materials in a scene. Practical methods for finding the direct and global components use multiple images captured under varying illumination patterns and require the scene, light source and camera to remain sta- tionary during the image acquisition process. In this pa- per, we develop a motion compensation method that relaxes this condition and allows direct-global separation to be per- formed on video sequences of dynamic scenes captured by moving projector-camera systems. Key to our method is be- ing able to register frames in a video sequence to each other in the presence of time varying, high frequency active illu- mination patterns. We compare our motion compensated method to alternatives such as single shot separation and frame interleaving as well as ground truth. We present re- sults on challenging video sequences that include various types of motions and deformations in scenes that contain complex materials like fabric, skin, leaves and wax.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: We consider the problem of estimating the extrinsic pa- rameters (pose) of a camera with respect to a reference 3D object without a direct view. Since the camera does not view the object directly, previous approaches have utilized reflec- tions in a planar mirror to solve this problem. However, a planar mirror based approach requires a minimum of three reflections and has degenerate configurations where esti- mation fails. In this paper, we show that the pose can be obtained using a single reflection in a spherical mirror of known radius. This makes our approach simpler and easier in practice. In addition, unlike planar mirrors, the spher- ical mirror based approach does not have any degenerate configurations, leading to a robust algorithm.
While a planar mirror reflection results in a virtual per- spective camera, a spherical mirror reflection results in a non-perspective axial camera. The axial nature of rays al- lows us to compute the axis (direction of sphere center) and few pose parameters in a linear fashion. We then derive an analytical solution to obtain the distance to the sphere cen- ter and remaining pose parameters and show that it corre- sponds to solving a 16th degree equation. We present com- parisons with a recent method that use planar mirrors and show that our approach recovers more accurate pose in the presence of noise. Extensive simulations and results on real data validate our algorithm.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Random Grids: Fast Approximate Nearest Neighbors and Range Searching for Image Search [pdf]
Dror Aiger, Efi Kokiopoulou, Ehud Rivlin |
|
Abstract: We propose two solutions for both nearest neigh- bors and range search problems. For the nearest neighbors problem, we propose a c-approximate so- lution for the restricted version of the decision prob- lem with bounded radius which is then reduced to the nearest neighbors by a known reduction. For range searching we propose a scheme that learns the parameters in a learning stage adopting them to the case of a set of points with low intrinsic dimen- sion that are embedded in high dimensional space (common scenario for image point descriptors). We compare our algorithms to the best known methods for these problems, i.e. LSH, ANN and FLANN. We show analytically and experimentally that we can do better for moderate approximation factor. Our algorithms are trivial to parallelize. In the experi- ments conducted, running on couple of million im- ages, our algorithms show meaningful speed-ups when compared with the above mentioned methods.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Pose Estimation and Segmentation of People in 3D Movies [pdf]
Karteek Alahari, Guillaume Seguin, Josef Sivic, Ivan Laptev |
|
Abstract: We seek to obtain a pixel-wise segmentation and pose estimation of multiple people in a stereoscopic video. This involves challenges such as dealing with unconstrained stereoscopic video, non-stationary cameras, and complex indoor and outdoor dynamic scenes. The contributions of our work are two-fold: First, we develop a segmentation model incorporating person detection, pose estimation, as well as colour, motion, and disparity cues. Our new model explicitly represents depth ordering and occlusion. Second, we introduce a stereoscopic dataset with frames extracted from feature-length movies StreetDance 3D and Pina. The dataset contains 2727 realistic stereo pairs and in- cludes annotation of human poses, person bounding boxes, and pixel-wise segmentations for hundreds of people. The dataset is composed of indoor and outdoor scenes depicting multiple people with frequent occlusions. We demonstrate results on our new challenging dataset, as well as on the H2view dataset from (Sheasby et al. ACCV 2012).
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: In this paper a notion of flow complexity that measures the amount of interaction among objects is introduced and an approach to compute it directly from a video sequence is proposed. The approach employs particle trajectories as the input representation of motion and maps it into a braid based representation. The mapping is based on the obser- vation that 2D trajectories of particles take the form of a braid in space-time due to the intermingling among parti- cles over time. As a result of this mapping, the problem of estimating the flow complexity from particle trajectories be- comes the problem of estimating braid complexity, which in turn can be computed by measuring the topological entropy of a braid. For this purpose recently developed mathemati- cal tools from braid theory are employed which allow rapid computation of topological entropy of braids. The approach is evaluated on a dataset consisting of open source videos depicting variations in terms of types of moving objects, scene layout, camera view angle, motion patterns, and ob- ject densities. The results show that the proposed approach is able to quantify the complexity of the flow, and at the same time provides useful insights about the sources of the complexity.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Handwritten Word Spotting with Corrected Attributes [pdf]
Jon Almazan, Albert Gordo, Alicia Fornes, Ernest Valveny |
|
Abstract: We propose an approach to multi-writer word spotting, where the goal is to find a query word in a dataset com- prised of document images. We propose an attributes-based approach that leads to a low-dimensional, fixed-length rep- resentation of the word images that is fast to compute and, especially, fast to compare. This approach naturally leads to an unified representation of word images and strings, which seamlessly allows one to indistinctly perform query- by-example, where the query is an image, and query-by- string, where the query is a string. We also propose a cal- ibration scheme to correct the attributes scores based on Canonical Correlation Analysis that greatly improves the results on a challenging dataset. We test our approach on two public datasets showing state-of-the-art results.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Calibration-Free Gaze Estimation Using Human Gaze Patterns [pdf]
Fares Alnajar, Theo Gevers, Roberto Valenti, Sennay Ghebreab |
|
Abstract: We present a novel method to auto-calibrate gaze esti- mators based on gaze patterns obtained from other viewers. Our method is based on the observation that the gaze pat- terns of humans are indicative of where a new viewer will look at [12]. When a new viewer is looking at a stimu- lus, we first estimate a topology of gaze points (initial gaze points). Next, these points are transformed so that they match the gaze patterns of other humans to find the correct gaze points.
In a flexible uncalibrated setup with a web camera and no chin rest, the proposed method was tested on ten sub- jects and ten images. The method estimates the gaze points after looking at a stimulus for a few seconds with an aver- age accuracy of 4.3. Although the reported performance is lower than what could be achieved with dedicated hard- ware or calibrated setup, the proposed method still provides a sufficient accuracy to trace the viewer attention. This is promising considering the fact that auto-calibration is done in a flexible setup , without the use of a chin rest, and based only on a few seconds of gaze initialization data. To the best of our knowledge, this is the first work to use human gaze patterns in order to auto-calibrate gaze estimators.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Monte Carlo Tree Search for Scheduling Activity Recognition [pdf]
Mohamed R. Amer, Sinisa Todorovic, Alan Fern, Song-Chun Zhu |
|
Abstract: This paper presents an efficient approach to video pars- ing. Our videos show a number of co-occurring individ- ual and group activities. To address challenges of the do- main, we use an expressive spatiotemporal AND-OR graph (ST-AOG) that jointly models activity parts, their spatiotem- poral relations, and context, as well as enables multitarget tracking. The standard ST-AOG inference is prohibitively expensive in our setting, since it would require running a multitude of detectors, and tracking their detections in a long video footage. This problem is addressed by for- mulating a cost-sensitive inference of ST-AOG as Monte Carlo Tree Search (MCTS). For querying an activity in the video, MCTS optimally schedules a sequence of detectors and trackers to be run, and where they should be applied in the space-time volume. Evaluation on the benchmark datasets demonstrates that MCTS enables two-magnitude speed-ups without compromising accuracy relative to the standard cost-insensitive inference.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: The task of object pose estimation has been a challenge since the early days of computer vision. To estimate the pose (or viewpoint) of an object, people have mostly looked at object intrinsic features, such as shape or appearance. Surprisingly, informative features provided by other, exter- nal elements in the scene, have so far mostly been ignored. At the same time, contextual cues have been shown to be of great benefit for related tasks such as object detection or action recognition. In this paper, we explore how in- formation from other objects in the scene can be exploited for pose estimation. In particular, we look at object con- figurations. We show that, starting from noisy object de- tections and pose estimates, exploiting the estimated pose and location of other objects in the scene can help to esti- mate the objects poses more accurately. We explore both a camera-centered as well as an object-centered represen- tation for relations. Experiments on the challenging KITTI dataset show that object configurations can indeed be used as a complementary cue to appearance-based pose estima- tion. In addition, object-centered relational representations can also assist object detection.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Revisiting Example Dependent Cost-Sensitive Learning with Decision Trees [pdf]
Oisin Mac Aodha, Gabriel J. Brostow |
|
Abstract: Typical approaches to classification treat class labels as disjoint. For each training example, it is assumed that there is only one class label that correctly describes it, and that all other labels are equally bad. We know however, that good and bad labels are too simplistic in many scenarios, hurting accuracy. In the realm of example dependent cost- sensitive learning, each label is instead a vector represent- ing a data points affinity for each of the classes. At test time, our goal is not to minimize the misclassification rate, but to maximize that affinity. We propose a novel exam- ple dependent cost-sensitive impurity measure for decision trees. Our experiments show that this new impurity measure improves test performance while still retaining the fast test times of standard classification trees. We compare our ap- proach to classification trees and other cost-sensitive meth- ods on three computer vision problems, tracking, descriptor matching, and optical flow, and show improvements in all three domains.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: This paper addresses the data assignment problem in multi frame multi object tracking in video sequences. Traditional methods employing maximum weight bipar- tite matching offer limited temporal modeling. It has re- cently been shown [6, 8, 24] that incorporating higher or- der temporal constraints improves the assignment solution. Finding maximum weight matching with higher order con- straints is however NP-hard and the solutions proposed un- til now have either been greedy [8] or rely on greedy round- ing of the solution obtained from spectral techniques [15]. We propose a novel algorithm to find the approximate solu- tion to data assignment problem with higher order temporal constraints using the method of dual decomposition and the MPLP message passing algorithm [21]. We compare the proposed algorithm with an implementation of [8] and [15] and show that proposed technique provides better solution with a bound on approximation factor for each inferred so- lution.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Quantize and Conquer: A Dimensionality-Recursive Solution to Clustering, Vector Quantization, and Image Retrieval [pdf]
Yannis Avrithis |
|
Abstract: Inspired by the close relation between nearest neighbor search and clustering in high-dimensional spaces as well as the success of one helping to solve the other, we introduce a new paradigm where both problems are solved simultane- ously. Our solution is recursive, not in the size of input data but in the number of dimensions. One result is a cluster- ing algorithm that is tuned to small codebooks but does not need all data in memory at the same time and is practically constant in the data size. As a by-product, a tree struc- ture performs either exact or approximate quantization on trained centroids, the latter being not very precise but ex- tremely fast. A lesser contribution is a new indexing scheme for image retrieval that exploits multiple small codebooks to provide an arbitrarily fine partition of the descriptor space. Large scale experiments on public datasets exhibit state of the art performance and remarkable generalization.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Finding Causal Interactions in Video Sequences [pdf]
Mustafa Ayazoglu, Burak Yilmaz, Mario Sznaier, Octavia Camps |
|
Abstract: This paper considers the problem of detecting causal in- teractions in video clips. Specifically, the goal is to detect whether the actions of a given target can be explained in terms of the past actions of a collection of other agents. We propose to solve this problem by recasting it into a directed graph topology identification, where each node corresponds to the observed motion of a given target, and each link in- dicates the presence of a causal correlation. As shown in the paper, this leads to a block-sparsification problem that can be efficiently solved using a modified Group-Lasso type approach, capable of handling missing data and outliers (due for instance to occlusion and mis-identified correspon- dences). Moreover, this approach also identifies time in- stants where the interactions between agents change, thus providing event detection capabilities. These results are il- lustrated with several examples involving nontrivial inter- actions amongst several human subjects.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Randomized Ensemble Tracking [pdf]
Qinxun Bai, Zheng Wu, Stan Sclaroff, Margrit Betke, Camille Monnier |
|
Abstract: We propose a randomized ensemble algorithm to model the time-varying appearance of an object for visual track- ing. In contrast with previous online methods for updat- ing classifier ensembles in tracking-by-detection, the weight vector that combines weak classifiers is treated as a ran- dom variable and the posterior distribution for the weight vector is estimated in a Bayesian manner. In essence, the weight vector is treated as a distribution that reflects the confidence among the weak classifiers used to construct and adapt the classifier ensemble. The resulting formula- tion models the time-varying discriminative ability among weak classifiers so that the ensembled strong classifier can adapt to the varying appearance, backgrounds, and occlu- sions. The formulation is tested in a tracking-by-detection implementation. Experiments on 28 challenging benchmark videos demonstrate that the proposed method can achieve results comparable to and often better than those of state- of-the-art approaches.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Unsupervised Domain Adaptation by Domain Invariant Projection [pdf]
Mahsa Baktashmotlagh, Mehrtash T. Harandi, Brian C. Lovell, Mathieu Salzmann |
|
Abstract: Domain-invariant representations are key to addressing the domain shift problem where the training and test exam- ples follow different distributions. Existing techniques that have attempted to match the distributions of the source and target domains typically compare these distributions in the original feature space. This space, however, may not be di- rectly suitable for such a comparison, since some of the fea- tures may have been distorted by the domain shift, or may be domain specific. In this paper, we introduce a Domain Invariant Projection approach: An unsupervised domain adaptation method that overcomes this issue by extracting the information that is invariant across the source and tar- get domains. More specifically, we learn a projection of the data to a low-dimensional latent space where the distance between the empirical distributions of the source and target examples is minimized. We demonstrate the effectiveness of our approach on the task of visual object recognition and show that it outperforms state-of-the-art methods on a stan- dard domain adaptation benchmark dataset.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Space-Time Robust Representation for Action Recognition [pdf]
Nicolas Ballas, Yi Yang, Zhen-Zhong Lan, Bertrand Delezoide, Francoise Preteux, Alexander Hauptmann |
|
Abstract: We address the problem of action recognition in uncon- strained videos. We propose a novel content driven pool- ing that leverages space-time context while being robust to- ward global space-time transformations. Being robust to such transformations is of primary importance in uncon- strained videos where the action localizations can drasti- cally shift between frames. Our pooling identifies regions of interest using video structural cues estimated by differ- ent saliency functions. To combine the different structural information, we introduce an iterative structure learning al- gorithm, WSVM (weighted SVM), that determines the opti- mal saliency layout of an action model through a sparse reg- ularizer. A new optimization method is proposed to solve the WSVM highly non-smooth objective function. We evaluate our approach on standard action datasets (KTH, UCF50 and HMDB). Most noticeably, the accuracy of our algo- rithm reaches 51.8% on the challenging HMDB dataset which outperforms the state-of-the-art of 7.3% relatively.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Collecting and annotating videos of realistic human ac- tions is tedious, yet critical for training action recognition systems. We propose a method to actively request the most useful video annotations among a large set of unlabeled videos. Predicting the utility of annotating unlabeled video is not trivial, since any given clip may contain multiple ac- tions of interest, and it need not be trimmed to temporal regions of interest. To deal with this problem, we propose a detection-based active learner to train action category models. We develop a voting-based framework to local- ize likely intervals of interest in an unlabeled clip, and use them to estimate the total reduction in uncertainty that an- notating that clip would yield. On three datasets, we show our approach can learn accurate action detectors more effi- ciently than alternative active learning strategies that fail to accommodate the untrimmed nature of real video data.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Fast Sparsity-Based Orthogonal Dictionary Learning for Image Restoration [pdf]
Chenglong Bao, Jian-Feng Cai, Hui Ji |
|
Abstract: In recent years, how to learn a dictionary from input im- ages for sparse modelling has been one very active topic in image processing and recognition. Most existing dic- tionary learning methods consider an over-complete dic- tionary, e.g. the K-SVD method. Often they require solv- ing some minimization problem that is very challenging in terms of computational feasibility and efficiency. How- ever, if the correlations among dictionary atoms are not well constrained, the redundancy of the dictionary does not necessarily improve the performance of sparse coding. This paper proposed a fast orthogonal dictionary learning method for sparse image representation. With comparable performance on several image restoration tasks, the pro- posed method is much more computationally efficient than the over-complete dictionary based learning methods.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Fast High Dimensional Vector Multiplication Face Recognition [pdf]
Oren Barkan, Jonathan Weill, Lior Wolf, Hagai Aronowitz |
|
Abstract: This paper advances descriptor-based face recognition by suggesting a novel usage of descriptors to form an over-complete representation, and by proposing a new metric learning pipeline within the same/not-same framework. First, the Over-Complete Local Binary Patterns (OCLBP) face representation scheme is introduced as a multi-scale modified version of the Local Binary Patterns (LBP) scheme. Second, we propose an efficient matrix-vector multiplication-based recognition system. The system is based on Linear Discriminant Analysis (LDA) coupled with Within Class Covariance Normalization (WCCN). This is further extended to the unsupervised case by proposing an unsupervised variant of WCCN. Lastly, we introduce Diffusion Maps (DM) for non-linear dimensionality reduction as an alternative to the Whitened Principal Component Analysis (WPCA) method which is often used in face recognition.
We evaluate the proposed framework on the LFW face recognition dataset under the restricted, unrestricted and unsupervised protocols. In all three cases we achieve very competitive results.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Volumetric Semantic Segmentation Using Pyramid Context Features [pdf]
Jonathan T. Barron, Mark D. Biggin, Pablo Arbelaez, David W. Knowles, Soile V.E. Keranen, Jitendra Malik |
|
Abstract: We present an algorithm for the per-voxel semantic seg- mentation of a three-dimensional volume. At the core of our algorithm is a novel pyramid context feature, a de- scriptive representation designed such that exact per-voxel linear classification can be made extremely efficient. This feature not only allows for efficient semantic segmentation but enables other aspects of our algorithm, such as novel learned features and a stacked architecture that can reason about self-consistency. We demonstrate our technique on 3D fluorescence microscopy data of Drosophila embryos for which we are able to produce extremely accurate semantic segmentations in a matter of minutes, and for which other algorithms fail due to the size and high-dimensionality of the data, or due to the difficulty of the task.
|
Similar papers:
[rank all papers by similarity to this]
|
|
A Robust Analytical Solution to Isometric Shape-from-Template with Focal Length Calibration [pdf]
Adrien Bartoli, Daniel Pizarro, Toby Collins |
|
Abstract: We study the uncalibrated isometric Shape-from- Template problem, that consists in estimating an isometric deformation from a template shape to an input image whose focal length is unknown.
Our method is the first that combines the following fea- tures: solving for both the 3D deformation and the cam- eras focal length, involving only local analytical solutions (there is no numerical optimization), being robust to mis- matches, handling general surfaces and running extremely fast. This was achieved through two key steps. First, an un- calibrated 3D deformation is computed thanks to a novel piecewise weak-perspective projection model. Second, the cameras focal length is estimated and enables upgrading the 3D deformation to metric. We use a variational frame- work, implemented using a smooth function basis and sam- pled local deformation models. The only degeneracy which we easily detect for focal length estimation is a flat and fronto-parallel surface.
Experimental results on simulated and real datasets show that our method achieves a 3D shape accuracy slightly below state of the art methods using a precalibrated or the true focal length, and a focal length accuracy slightly below static calibration methods.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: How do you tell a blackbird from a crow? There has been great progress toward automatic methods for visual recog- nition, including fine-grained visual categorization in which the classes to be distinguished are very similar. In a task such as bird species recognition, automatic recognition sys- tems can now exceed the performance of non-experts most people are challenged to name a couple dozen bird species, let alone identify them. This leads us to the question, Can a recognition system show humans what to look for when identifying classes (in this case birds)? In the context of fine-grained visual categorization, we show that we can au- tomatically determine which classes are most visually sim- ilar, discover what visual features distinguish very similar classes, and illustrate the key features in a way meaning- ful to humans. Running these methods on a dataset of bird images, we can generate a visual field guide to birds which includes a tree of similarity that displays the similarity re- lations between all species, pages for each species showing the most similar other species, and pages for each pair of similar species illustrating their differences.
|
Similar papers:
[rank all papers by similarity to this]
|
|
PhotoOCR: Reading Text in Uncontrolled Conditions [pdf]
Alessandro Bissacco, Mark Cummins, Yuval Netzer, Hartmut Neven |
|
Abstract: We describe PhotoOCR, a system for text extraction from images. Our particular focus is reliable text extraction from smartphone imagery, with the goal of text recognition as a user input modality similar to speech recognition. Commer- cially available OCR performs poorly on this task. Recent progress in machine learning has substantially improved isolated character classification; we build on this progress by demonstrating a complete OCR system using these tech- niques. We also incorporate modern datacenter-scale dis- tributed language modelling. Our approach is capable of recognizing text in a variety of challenging imaging con- ditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions. It also operates with low latency; mean processing time is 600 ms per im- age. We evaluate our system on public benchmark datasets for text extraction and outperform all previously reported results, more than halving the error rate on multiple bench- marks. The system is currently in use in many applica- tions at Google, and is available as a user input modality in Google Translate for Android.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Finding Actors and Actions in Movies [pdf]
P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, J. Sivic |
|
Abstract: We address the problem of learning a joint model of actors and actions in movies using weak supervision pro- vided by scripts. Specifically, we extract actor/action pairs from the script and use them as constraints in a discrimi- native clustering framework. The corresponding optimiza- tion problem is formulated as a quadratic program under linear constraints. People in video are represented by au- tomatically extracted and tracked faces together with cor- responding motion features. First, we apply the proposed framework to the task of learning names of characters in the movie and demonstrate significant improvements over previous methods used for this task. Second, we explore the joint actor/action constraint and show its advantage for weakly supervised action learning. We validate our method in the challenging setting of localizing and recog- nizing characters and their actions in feature length movies Casablanca and American Beauty.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Analysis of Scores, Datasets, and Models in Visual Saliency Prediction [pdf]
Ali Borji, Hamed R. Tavakoli, Dicky N. Sihite, Laurent Itti |
|
Abstract: Significant recent progress has been made in developing high-quality saliency models. However, less effort has been undertaken on fair assessment of these models, over large standardized datasets and correctly addressing confound- ing factors. In this study, we pursue a critical and quanti- tative look at challenges (e.g., center-bias, map smoothing) in saliency modeling and the way they affect model accu- racy. We quantitatively compare 32 state-of-the-art mod- els (using the shuffled AUC score to discount center-bias) on 4 benchmark eye movement datasets, for prediction of human fixation locations and scanpath sequence. We also account for the role of map smoothing. We find that, al- though model rankings vary, some (e.g., AWS, LG, AIM, and HouNIPS) consistently outperform other models over all datasets. Some models work well for prediction of both fix- ation locations and scanpath sequence (e.g., Judd, GBVS). Our results show low prediction accuracy for models over emotional stimuli from the NUSEF dataset. Our last bench- mark, for the first time, gauges the ability of models to de- code the stimulus category from statistics of fixations, sac- cades, and model saliency values at fixated locations. In this test, ITTI and AIM models win over other models. Our benchmark provides a comprehensive high-level picture of the strengths and weaknesses of many popular models, and suggests future research directions in saliency modeling.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Event Recognition in Photo Collections with a Stopwatch HMM [pdf]
Lukas Bossard, Matthieu Guillaumin, Luc Van_Gool |
|
Abstract: The task of recognizing events in photo collections is central for automatically organizing images. It is also very challenging, because of the ambiguity of photos across dif- ferent event classes and because many photos do not con- vey enough relevant information. Unfortunately, the field still lacks standard evaluation data sets to allow compar- ison of different approaches. In this paper, we introduce and release a novel data set of personal photo collections containing more than 61,000 images in 807 collections, an- notated with 14 diverse social event classes.
Casting collections as sequential data, we build upon re- cent and state-of-the-art work in event recognition in videos to propose a latent sub-event approach for event recogni- tion in photo collections. However, photos in collections are sparsely sampled over time and come in bursts from which transpires the importance of specific moments for the pho- tographers. Thus, we adapt a discriminative hidden Markov model to allow the transitions between states to be a func- tion of the time gap between consecutive images, which we coin as Stopwatch Hidden Markov model (SHMM).
In our experiments, we show that our proposed model outperforms approaches based only on feature pooling or a classical hidden Markov model. With an average accuracy of 56%, we also highlight the difficulty of the data set and the need for future advances in event recognition in photo collections.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Estimating the Material Properties of Fabric from Video [pdf]
Katherine L. Bouman, Bei Xiao, Peter Battaglia, William T. Freeman |
|
Abstract: Passively estimating the intrinsic material properties of deformable objects moving in a natural environment is es- sential for scene understanding. We present a framework to automatically analyze videos of fabrics moving under var- ious unknown wind forces, and recover two key material properties of the fabric: stiffness and area weight. We ex- tend features previously developed to compactly represent static image textures to describe video textures, such as fab- ric motion. A discriminatively trained regression model is then used to predict the physical properties of fabric from these features. The success of our model is demonstrated on a new, publicly available database of fabric videos with cor- responding measured ground truth material properties. We show that our predictions are well correlated with ground truth measurements of stiffness and density for the fabrics. Our contributions include: (a) a database that can be used for training and testing algorithms for passively predicting fabric properties from video, (b) an algorithm for predict- ing the material properties of fabric from a video, and (c) a perceptual study of humans ability to estimate the material properties of fabric from videos and images.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Correspondence matching is one of the most common problems in computer vision, and it is often solved using photo-consistency of local regions. These approaches typ- ically assume that the frequency content in the local re- gion is consistent in the image pair, such that matching is performed on similar signals. However, in many practical situations this is not the case, for example with low depth of field cameras a scene point may be out of focus in one view and in-focus in the other, causing a mismatch of fre- quency signals. Furthermore, this mismatch can vary spa- tially over the entire image. In this paper we propose a local signal equalization approach for correspondence matching. Using a measure of local image frequency, we equalize lo- cal signals using an efficient scale-space image representa- tion such that their frequency contents are optimally suited for matching. Our approach allows better correspondence matching, which we demonstrate with a number of stereo reconstruction examples on synthetic and real datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Bayesian 3D Tracking from Monocular Video [pdf]
Ernesto Brau, Jinyan Guan, Kyle Simek, Luca Del Pero, Colin Reimer Dawson, Kobus Barnard |
|
Abstract: We develop a Bayesian modeling approach for tracking people in 3D from monocular video with unknown cam- eras. Modeling in 3D provides natural explanations for occlusions and smoothness discontinuities that result from projection, and allows priors on velocity and smoothness to be grounded in physical quantities: meters and seconds vs. pixels and frames. We pose the problem in the context of data association, in which observations are assigned to tracks. A correct application of Bayesian inference to multi- target tracking must address the fact that the models di- mension changes as tracks are added or removed, and thus, posterior densities of different hypotheses are not compa- rable. We address this by marginalizing out the trajectory parameters so the resulting posterior over data associa- tions has constant dimension. This is made tractable by using (a) Gaussian process priors for smooth trajectories and (b) approximately Gaussian likelihood functions. Our approach provides a principled method for incorporating multiple sources of evidence; we present results using both optical flow and object detector outputs. Results are com- parable to recent work on 3D tracking and, unlike others, our method requires no pre-calibrated cameras.
|
Similar papers:
[rank all papers by similarity to this]
|
|
A General Dense Image Matching Framework Combining Direct and Feature-Based Costs [pdf]
Jim Braux-Zin, Romain Dupont, Adrien Bartoli |
|
Abstract: Dense motion field estimation (typically optical flow, stereo disparity and surface registration) is a key computer vision problem. Many solutions have been proposed to com- pute small or large displacements, narrow or wide baseline stereo disparity, but a unified methodology is still lacking. We here introduce a general framework that robustly com- bines direct and feature-based matching. The feature-based cost is built around a novel robust distance function that handles keypoints and weak features such as segments. It allows us to use putative feature matches which may con- tain mismatches to guide dense motion estimation out of local minima. Our framework uses a robust direct data term (AD-Census). It is implemented with a powerful second or- der Total Generalized Variation regularization with external and self-occlusion reasoning. Our framework achieves state of the art performance in several cases (standard optical flow benchmarks, wide-baseline stereo and non-rigid sur- face registration). Our framework has a modular design that customizes to specific application needs.
Introduction
A dense motion field, also called optical flow, is a very useful cue for problems such as tracking, segmentation, local- ization and reconstruction, or non-rigid surfaces registration. Optical flow estimation is an old computer vision problem. While early techniques were patch-based [19], current ones estimate dense flow fields with variational methods built upon the work by Horn and Schu
|
Similar papers:
[rank all papers by similarity to this]
|
|
Robust Face Landmark Estimation under Occlusion [pdf]
Xavier P. Burgos-Artizzu, Pietro Perona, Piotr Dollar |
|
Abstract: Human faces captured in real-world conditions present large variations in shape and occlusions due to differences in pose, expression, use of accessories such as sunglasses and hats and interactions with objects (e.g. food). Current face landmark estimation approaches struggle under such conditions since they fail to provide a principled way of han- dling outliers. We propose a novel method, called Robust Cascaded Pose Regression (RCPR) which reduces exposure to outliers by detecting occlusions explicitly and using ro- bust shape-indexed features. We show that RCPR improves on previous landmark estimation methods on three popu- lar face datasets (LFPW, LFW and HELEN). We further explore RCPRs performance by introducing a novel face dataset focused on occlusion, composed of 1,007 faces pre- senting a wide range of occlusion patterns. RCPR reduces failure cases by half on all four datasets, at the same time as it detects face occlusions with a 80/40% precision/recall.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: In this paper, we propose a new family of binary local feature descriptors called nested shape descriptors. These descriptors are constructed by pooling oriented gradients over a large geometric structure called the Hawaiian ear- ring, which is constructed with a nested correlation struc- ture that enables a new robust local distance function called the nesting distance. This distance function is unique to the nested descriptor and provides robustness to outliers from order statistics. In this paper, we define the nested shape descriptor family and introduce a specific member called the seed-of-life descriptor. We perform a trade study to de- termine optimal descriptor parameters for the task of im- age matching. Finally, we evaluate performance compared to state-of-the-art local feature descriptors on the VGG- Affine image matching benchmark, showing significant per- formance gains. Our descriptor is the first binary descriptor to outperform SIFT on this benchmark.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Unifying Nuclear Norm and Bilinear Factorization Approaches for Low-Rank Matrix Decomposition [pdf]
Ricardo Cabral, Fernando De_La_Torre, Joao P. Costeira, Alexandre Bernardino |
|
Abstract: Bilinear factorization:
minf(XUV) U,V
Nuclear norm regularization:
minf(XZ)+Z Z
Variational definition of the nuclear norm:
Z= min 1U2+V2 Z=UV 2 F F
Unified model:
minf(XUV)+U2 +V2 U,V 2FF
Low rank models have been widely used for the represen- tation of shape, appearance or motion in computer vision problems. Traditional approaches to fit low rank models make use of an explicit bilinear factorization. These ap- proaches benefit from fast numerical methods for optimiza- tion and easy kernelization. However, they suffer from seri- ous local minima problems depending on the loss function and the amount/type of missing data. Recently, these low- rank models have alternatively been formulated as convex problems using the nuclear norm regularizer; unlike factor- ization methods, their numerical solvers are slow and it is unclear how to kernelize them or to impose a rank a priori.
This paper proposes a unified approach to bilinear fac- torization and nuclear norm regularization, that inherits the benefits of both. We analyze the conditions under which these approaches are equivalent. Moreover, based on this analysis, we propose a new optimization algorithm and a rank continuation strategy that outperform state-of-the- art approaches for Robust PCA, Structure from Motion and Photometric Stereo with outliers and missing data.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Heterogeneous Image Features Integration via Multi-modal Semi-supervised Learning Model [pdf]
Xiao Cai, Feiping Nie, Weidong Cai, Heng Huang |
|
Abstract: Automatic image categorization has become increas- ingly important with the development of Internet and the growth in the size of image databases. Although the im- age categorization can be formulated as a typical multi- class classification problem, two major challenges have been raised by the real-world images. On one hand, though using more labeled training data may improve the predic- tion performance, obtaining the image labels is a time con- suming as well as biased process. On the other hand, more and more visual descriptors have been proposed to describe objects and scenes appearing in images and different fea- tures describe different aspects of the visual characteristics. Therefore, how to integrate heterogeneous visual features to do the semi-supervised learning is crucial for categorizing large-scale image data. In this paper, we propose a novel approach to integrate heterogeneous features by performing multi-modal semi-supervised classification on unlabeled as well as unsegmented images. Considering each type of fea- ture as one modality, taking advantage of the large amoun- t of unlabeled data information, our new adaptive multi- modal semi-supervised classification (AMMSS) algorithm learns a commonly shared class indicator matrix and the weights for different modalities (image features) simultane- ously.
|
Similar papers:
[rank all papers by similarity to this]
|
|
New Graph Structured Sparsity Model for Multi-label Image Annotations [pdf]
Xiao Cai, Feiping Nie, Weidong Cai, Heng Huang |
|
Abstract: In multi-label image annotations, because each image is associated to multiple categories, the semantic terms (la- bel classes) are not mutually exclusive. Previous research showed that such label correlations can largely boost the annotation accuracy. However, all existing methods only directly apply the label correlation matrix to enhance the label inference and assignment without further learning the structural information among classes. In this paper, we model the label correlations using the relational graph, and propose a novel graph structured sparse learning model to incorporate the topological constraints of relation graph in multi-label classifications. As a result, our new method will capture and utilize the hidden class structures in relational graph to improve the annotation results. In proposed objec- tive, a large number of structured sparsity-inducing norm- s are utilized, thus the optimization becomes difficult. To solve this problem, we derive an efficient optimization algo- rithm with proved convergence. We perform extensive ex- periments on six multi-label image annotation benchmark data sets. In all empirical results, our new method shows better annotation results than the state-of-the-art approach- es.
|
Similar papers:
[rank all papers by similarity to this]
|
|
An Enhanced Structure-from-Motion Paradigm Based on the Absolute Dual Quadric and Images of Circular Points [pdf]
Lilian Calvet, Pierre Gurdjos |
|
Abstract: This work aims at introducing a new unified Structure- from-Motion (SfM) paradigm in which images of circular point-pairs can be combined with images of natural points. An imaged circular point-pair encodes the 2D Euclidean structure of a world plane and can easily be derived from the image of a planar shape, especially those including cir- cles. A classical SfM method generally runs two steps: first a projective factorization of all matched image points (into projective cameras and points) and second a camera self- calibration that updates the obtained world from projective to Euclidean. This work shows how to introduce images of circular points in these two SfM steps while its key contri- bution is to provide the theoretical foundations for combin- ing classical linear self-calibration constraints with ad- ditional ones derived from such images. We show that the two proposed SfM steps clearly contribute to better results than the classical approach. We validate our contributions on synthetic and real images.
|
Similar papers:
[rank all papers by similarity to this]
|
|
A Practical Transfer Learning Algorithm for Face Verification [pdf]
Xudong Cao, David Wipf, Fang Wen, Genquan Duan, Jian Sun |
|
Abstract: Face verification involves determining whether a pair of facial images belongs to the same or different subjects. This problem can prove to be quite challenging in many im- portant applications where labeled training data is scarce, e.g., family album photo organization software. Herein we propose a principled transfer learning approach for merg- ing plentiful source-domain data with limited samples from some target domain of interest to create a classifier that ide- ally performs nearly as well as if rich target-domain data were present. Based upon a surprisingly simple generative Bayesian model, our approach combines a KL-divergence- based regularizer/prior with a robust likelihood function leading to a scalable implementation via the EM algorithm. As justification for our design choices, we later use prin- ciples from convex analysis to recast our algorithm as an equivalent structured rank minimization problem leading to a number of interesting insights related to solution struc- ture and feature-transform invariance. These insights help to both explain the effectiveness of our algorithm as well as elucidate a wide variety of related Bayesian approaches. Experimental testing with challenging datasets validate the utility of the proposed algorithm.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Recently, there is a considerable amount of efforts de- voted to the problem of unconstrained face verification, where the task is to predict whether pairs of images are from the same person or not. This problem is challenging and difficult due to the large variations in face images. In this paper, we develop a novel regularization framework to learn similarity metrics for unconstrained face verification. We formulate its objective function by incorporating the ro- bustness to the large intra-personal variations and the dis- criminative power of novel similarity metrics. In addition, our formulation is a convex optimization problem which guarantees the existence of its global solution. Experiments show that our proposed method achieves the state-of-the-art results on the challenging Labeled Faces in the Wild (LFW) database [10].
|
Similar papers:
[rank all papers by similarity to this]
|
|
SYM-FISH: A Symmetry-Aware Flip Invariant Sketch Histogram Shape Descriptor [pdf]
Xiaochun Cao, Hua Zhang, Si Liu, Xiaojie Guo, Liang Lin |
|
Abstract: Recently, studies on sketch, such as sketch retrieval and sketch classification, have received more attention in the computer vision community. One of its most fundamental and essential problems is how to more effectively describe a sketch image. Many existing descriptors, such as shape context, have achieved great success. In this paper, we pro- pose a new descriptor, namely Symmetric-aware Flip In- variant Sketch Histogram (SYM-FISH) to refine the shape context feature. Its extraction process includes three steps. First the Flip Invariant Sketch Histogram (FISH) descrip- tor is extracted on the input image, which is a flip-invariant version of the shape context feature. Then we explore the symmetry character of the image by calculating the kurto- sis coefficient. Finally, the SYM-FISH is generated by con- structing a symmetry table. The new SYM-FISH descrip- tor supplements the original shape context by encoding the symmetric information, which is a pervasive characteristic of natural scene and objects. We evaluate the efficacy of the novel descriptor in two applications, i.e., sketch retrieval and sketch classification. Extensive experiments on three datasets well demonstrate the effectiveness and robustness of the proposed SYM-FISH descriptor.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Symbiotic Segmentation and Part Localization for Fine-Grained Categorization [pdf]
Yuning Chai, Victor Lempitsky, Andrew Zisserman |
|
Abstract: We propose a new method for the task of fine-grained vi- sual categorization. The method builds a model of the base- level category that can be fitted to images, producing high- quality foreground segmentation and mid-level part local- izations. The model can be learnt from the typical datasets available for fine-grained categorization, where the only annotation provided is a loose bounding box around the in- stance (e.g. bird) in each image. Both segmentation and part localizations are then used to encode the image con- tent into a highly-discriminative visual signature.
The model is symbiotic in that part discov- ery/localization is helped by segmentation and, conversely, the segmentation is helped by the detection (e.g. part layout). Our model builds on top of the part-based object category detector of Felzenszwalb et al., and also on the powerful GrabCut segmentation algorithm of Rother et al., and adds a simple spatial saliency coupling between them. In our evaluation, the model improves the categorization accuracy over the state-of-the-art. It also improves over what can be achieved with an analogous system that runs segmentation and part-localization independently.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Rectangling Stereographic Projection for Wide-Angle Image Visualization [pdf]
Che-Han Chang, Min-Chun Hu, Wen-Huang Cheng, Yung-Yu Chuang |
|
Abstract: This paper proposes a new projection model for map- ping a hemisphere to a plane. Such a model can be use- ful for viewing wide-angle images. Our model consists of two steps. In the first step, the hemisphere is projected onto a swung surface constructed by a circular profile and a rounded rectangular trajectory. The second step maps the projected image on the swung surface onto the image plane through the perspective projection. We also propose a method for automatically determining proper parameters for the projection model based on image content. The pro- posed model has several advantages. It is simple, efficient and easy to control. Most importantly, it makes a bet- ter compromise between distortion minimization and line preserving than popular projection models, such as stereo- graphic and Pannini projections. Experiments and analysis demonstrate the effectiveness of our model.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Stacked Predictive Sparse Coding for Classification of Distinct Regions in Tumor Histopathology [pdf]
Hang Chang, Yin Zhou, Paul Spellman, Bahram Parvin |
|
Abstract: Image-based classification of histology sections, in terms of distinct components (e.g., tumor, stroma, normal), pro- vides a series of indices for tumor composition. Further- more, aggregation of these indices, from each whole slide image (WSI) in a large cohort, can provide predictive mod- els of the clinical outcome. However, performance of the existing techniques is hindered as a result of large technical variations and biological heterogeneities that are always present in a large cohort. We propose a system that au- tomatically learns a series of basis functions for represent- ing the underlying spatial distribution using stacked pre- dictive sparse decomposition (PSD). The learned represen- tation is then fed into the spatial pyramid matching frame- work (SPM) with a linear SVM classifier. The system has been evaluated for classification of (a) distinct histological components for two cohorts of tumor types, and (b) colony organization of normal and malignant cell lines in 3D cell culture models. Throughput has been increased through the utility of graphical processing unit (GPU), and evalu- ation indicates a superior performance results, compared with previous research.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: We present an integrated probabilistic model for layered object tracking that combines dynamics on implicit shape representations, topological shape constraints, adaptive ap- pearance models, and layered flow. The generative model combines the evolution of appearances and layer shapes with a Gaussian process flow and explicit layer ordering. Efficient MCMC sampling algorithms are developed to en- able a particle filtering approach while reasoning about the distribution of object boundaries in video. We demonstrate the utility of the proposed tracking algorithm on a wide vari- ety of video sources while achieving state-of-the-art results on a boundary-accurate tracking dataset.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: In this paper we address the problem of robust and effi- cient averaging of relative 3D rotations. Apart from having an interesting geometric structure, robust rotation averag- ing addresses the need for a good initialization for large- scale optimization used in structure-from-motion pipelines. Such pipelines often use unstructured image datasets har- vested from the internet thereby requiring an initialization method that is robust to outliers. Our approach works on the Lie group structure of 3D rotations and solves the prob- lem of large-scale robust rotation averaging in two ways. Firstly, we use modern l1 optimizers to carry out robust av- eraging of relative rotations that is efficient, scalable and robust to outliers. In addition, we also develop a two- step method that uses the l1 solution as an initialisation for an iteratively reweighted least squares (IRLS) approach. These methods achieve excellent results on large-scale, real world datasets and significantly outperform existing meth- ods, i.e. the state-of-the-art discrete-continuous optimiza- tion method of [3] as well as the Weiszfeld method of [8]. We demonstrate the efficacy of our method on two large- scale real world datasets and also provide the results of the two aforementioned methods for comparison.
|
Similar papers:
[rank all papers by similarity to this]
|
|
A Generalized Low-Rank Appearance Model for Spatio-temporally Correlated Rain Streaks [pdf]
Yi-Lei Chen, Chiou-Ting Hsu |
|
Abstract: In this paper, we propose a novel low-rank appearance model for removing rain streaks. Different from previous work, our method needs neither rain pixel detection nor time-consuming dictionary learning stage. Instead, as rain streaks usually reveal similar and repeated patterns on imaging scene, we propose and generalize a low-rank model from matrix to tensor structure in order to capture the spatio-temporally correlated rain streaks. With the appearance model, we thus remove rain streaks from image/video (and also other high-order image structure) in a unified way. Our experimental results demonstrate competitive (or even better) visual quality and efficient run-time in comparison with state of the art.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: We present a model for intrinsic decomposition of RGB-D images. Our approach analyzes a single RGB-D image and estimates albedo and shading fields that explain the input. To disambiguate the problem, our model esti- mates a number of components that jointly account for the reconstructed shading. By decomposing the shading field, we can build in assumptions about image formation that help distinguish reflectance variation from shading. These assumptions are expressed as simple nonlocal regularizers. We evaluate the model on real-world images and on a chal- lenging synthetic dataset. The experimental results demon- strate that the presented approach outperforms prior mod- els for intrinsic decomposition of RGB-D images.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Accurate and Robust 3D Facial Capture Using a Single RGBD Camera [pdf]
Yen-Lin Chen, Hsiang-Tao Wu, Fuhao Shi, Xin Tong, Jinxiang Chai |
|
Abstract: This paper presents an automatic and robust approach that accurately captures high-quality 3D facial perfor- mances using a single RGBD camera. The key of our ap- proach is to combine the power of automatic facial feature detection and image-based 3D nonrigid registration tech- niques for 3D facial reconstruction. In particular, we de- velop a robust and accurate image-based nonrigid regis- tration algorithm that incrementally deforms a 3D template mesh model to best match observed depth image data and important facial features detected from single RGBD im- ages. The whole process is fully automatic and robust be- cause it is based on single frame facial registration frame- work. The system is flexible because it does not require any strong 3D facial priors such as blendshape models. We demonstrate the power of our approach by capturing a wide range of 3D facial expressions using a single RGBD camera and achieve state-of-the-art accuracy by comparing against alternative methods.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Constructing Adaptive Complex Cells for Robust Visual Tracking [pdf]
Dapeng Chen, Zejian Yuan, Yang Wu, Geng Zhang, Nanning Zheng |
|
Abstract: Representation is a fundamental problem in object track- ing. Conventional methods track the target by describing its local or global appearance. In this paper we present that, besides the two paradigms, the composition of local region histograms can also provide diverse and important object cues. We use cells to extract local appearance, and construct complex cells to integrate the information from cells. With different spatial arrangements of cells, complex cells can explore various contextual information at multiple scales, which is important to improve the tracking perfor- mance. We also develop a novel template-matching algo- rithm for object tracking, where the template is composed of temporal varying cells and has two layers to capture the target and background appearance respectively. An adap- tive weight is associated with each complex cell to cope with occlusion as well as appearance variation. A fusion weight is associated with each complex cell type to preserve the global distinctiveness. Our algorithm is evaluated on 25 challenging sequences, and the results not only confirm the contribution of each component in our tracking system, but also outperform other competing trackers.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Group Norm for Learning Structured SVMs with Unstructured Latent Variables [pdf]
Daozheng Chen, Dhruv Batra, William T. Freeman |
|
Abstract: Latent variables models have been applied to a number of computer vision problems. However, the complexity of the latent space is typically left as a free design choice. A larger latent space results in a more expressive model, but such models are prone to overfitting and are slower to per- form inference with. The goal of this paper is to regularize the complexity of the latent space and learn which hidden states are really relevant for prediction. Specifically, we propose using group-sparsity-inducing regularizers such as l1-l2 to estimate the parameters of Structured SVMs with unstructured latent variables. Our experiments on digit recognition and object detection show that our approach is indeed able to control the complexity of latent space without any significant loss in accuracy of the learnt model.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Learning a Dictionary of Shape Epitomes with Applications to Image Labeling [pdf]
Liang-Chieh Chen, George Papandreou, Alan L. Yuille |
|
Abstract: The first main contribution of this paper is a novel method for representing images based on a dictionary of shape epitomes. These shape epitomes represent the local edge structure of the image and include hidden variables to encode shift and rotations. They are learnt in an unsu- pervised manner from groundtruth edges. This dictionary is compact but is also able to capture the typical shapes of edges in natural images. In this paper, we illustrate the shape epitomes by applying them to the image labeling task. In other work, described in the supplementary material, we apply them to edge detection and image modeling.
We apply shape epitomes to image labeling by using Conditional Random Field (CRF) Models. They are alter- natives to the superpixel or pixel representations used in most CRFs. In our approach, the shape of an image patch is encoded by a shape epitome from the dictionary. Unlike the superpixel representation, our method avoids making early decisions which cannot be reversed. Our resulting hi- erarchical CRFs efficiently capture both local and global class co-occurrence properties. We demonstrate its quanti- tative and qualitative properties of our approach with image labeling experiments on two standard datasets: MSRC-21 and Stanford Background.
|
Similar papers:
[rank all papers by similarity to this]
|
|
NEIL: Extracting Visual Knowledge from Web Data [pdf]
Xinlei Chen, Abhinav Shrivastava, Abhinav Gupta |
|
Abstract: We propose NEIL (Never Ending Image Learner), a com- puter program that runs 24 hours per day and 7 days per week to automatically extract visual knowledge from In- ternet data. NEIL uses a semi-supervised learning algo- rithm that jointly discovers common sense relationships (e.g., Corolla is a kind of/looks similar to Car,Wheel is a part of Car) and labels instances of the given visual categories. It is an attempt to develop the worlds largest visual structured knowledge base with minimum human la- beling effort. As of 10th October 2013, NEIL has been con- tinuously running for 2.5 months on 200 core cluster (more than 350K CPU hours) and has an ontology of 1152 object categories, 1034 scene categories and 87 attributes. During this period, NEIL has discovered more than 1700 relation- ships and has labeled more than 400K visual instances.
1. Motivation
Recent successes in computer vision can be primarily at- tributed to the ever increasing size of visual knowledge in terms of labeled instances of scenes, objects, actions, at- tributes, and the contextual relationships between them. But as we move forward, a key question arises: how will we gather this structured visual knowledge on a vast scale? Re- cent efforts such as ImageNet [8] and Visipedia [30] have tried to harness human intelligence for this task. However, we believe that these approaches lack both the richness and the scalability required for gathering massive amounts of visual knowledge. For example, at the
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Sparsity models have recently shown great promise in many vision tasks. Using a learned dictionary in sparsity models can in general outperform predefined bases in clean data. In practice, both training and testing data may be corrupted and contain noises and outliers. Although recent studies attempted to cope with corrupted data and achieved encouraging results in testing phase, how to handle corrup- tion in training phase still remains a very difficult problem. In contrast to most existing methods that learn the dictio- nary from clean data, this paper is targeted at handling cor- ruptions and outliers in training data for dictionary learn- ing. We propose a general method to decompose the recon- structive residual into two components: a non-sparse com- ponent for small universal noises and a sparse component for large outliers, respectively. In addition, further analysis reveals the connection between our approach and the par- tial dictionary learning approach, updating only part of the prototypes (or informative codewords) with remaining (or noisy codewords) fixed. Experiments on synthetic data as well as real applications have shown satisfactory per- formance of this new robust dictionary learning approach.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Efficient Salient Region Detection with Soft Image Abstraction [pdf]
Ming-Ming Cheng, Jonathan Warrell, Wen-Yan Lin, Shuai Zheng, Vibhav Vineet, Nigel Crook |
|
Abstract: on
Ming-Ming Cheng Jonathan Warrell Wen-Yan Lin Shuai Zheng Vision Group, Oxford Brookes University
Vibhav Vineet
(b) Our result
Nigel Crook
(c) Ground truth
Abstract
Detecting visually salient regions in images is one of the fundamental problems in computer vision. We propose a novel method to decompose an image into large scale per- ceptually homogeneous elements for efficient salient region detection, using a soft image abstraction representation. By considering both appearance similarity and spatial dis- tribution of image pixels, the proposed representation ab- stracts out unnecessary image details, allowing the assign- ment of comparable saliency values across similar regions, and producing perceptually accurate salient region detec- tion. We evaluate our salient region detection approach on the largest publicly available dataset with pixel accurate annotations. The experimental results show that the pro- posed method outperforms 18 alternate methods, reducing the mean absolute error by 25.2% compared to the previous best result, while being computationally more efficient.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Rank Minimization across Appearance and Shape for AAM Ensemble Fitting [pdf]
Xin Cheng, Sridha Sridharan, Jason Saragih, Simon Lucey |
|
Abstract: Active Appearance Models (AAMs) employ a paradigm of inverting a synthesis model of how an object can vary in terms of shape and appearance. As a result, the abil- ity of AAMs to register an unseen object image is intrin- sically linked to two factors. First, how well the synthesis model can reconstruct the object image. Second, the de- grees of freedom in the model. Fewer degrees of freedom yield a higher likelihood of good fitting performance. In this paper we look at how these seemingly contrasting factors can complement one another for the problem of AAM fitting of an ensemble of images stemming from a constrained set (e.g. an ensemble of face images of the same person).
|
Similar papers:
[rank all papers by similarity to this]
|
|
Affine-Constrained Group Sparse Coding and Its Application to Image-Based Classifications [pdf]
Yu-Tseh Chi, Mohsen Ali, Muhammad Rushdi, Jeffrey Ho |
|
Abstract: This paper proposes a novel approach for sparse coding that further improves upon the sparse representation-based classification (SRC) framework. The proposed framework, Affine-Constrained Group Sparse Coding (ACGSC), ex- tends the current SRC framework to classification problems with multiple input samples. Geometrically, the affine- constrained group sparse coding essentially searches for the vector in the convex hull spanned by the input vectors that can best be sparse coded using the given dictionary. The resulting objective function is still convex and can be ef- ficiently optimized using iterative block-coordinate descent scheme that is guaranteed to converge. Furthermore, we provide a form of sparse recovery result that guarantees, at least theoretically, that the classification performance of the constrained group sparse coding should be at least as good as the group sparse coding. We have evaluated the proposed approach using three different recognition ex- periments that involve illumination variation of faces and textures, and face recognition under occlusions. Prelimi- nary experiments have demonstrated the effectiveness of the proposed approach, and in particular, the results from the recognition/occlusion experiment are surprisingly accurate and robust.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Multi-attributed Dictionary Learning for Sparse Coding [pdf]
Chen-Kuo Chiang, Te-Feng Su, Chih Yen, Shang-Hong Lai |
|
Abstract: We present a multi-attributed dictionary learning algo- rithm for sparse coding. Considering training samples with multiple attributes, a new distance matrix is proposed by jointly incorporating data and attribute similarities. Then, an objective function is presented to learn category- dependent dictionaries that are compact (closeness of dic- tionary atoms based on data distance and attribute similar- ity), reconstructive (low reconstruction error with correct dictionary) and label-consistent (encouraging the labels of dictionary atoms to be similar). We have demonstrated our algorithm on action classification and face recognition tasks on several publicly available datasets. Experimental results with improved performance over previous dictionary learning methods are shown to validate the effectiveness of the proposed algorithm.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Many tasks in computer vision are formulated as graph matching problems. Despite the NP-hard nature of the problem, fast and accurate approximations have led to sig- nificant progress in a wide range of applications. Learning graph models from observed data, however, still remains a challenging issue. This paper presents an effective scheme to parameterize a graph model, and learn its structural at- tributes for visual object matching. For this, we propose a graph representation with histogram-based attributes, and optimize them to increase the matching accuracy. Exper- imental evaluations on synthetic and real image datasets demonstrate the effectiveness of our approach, and show significant improvement in matching accuracy over graphs with pre-defined structures.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Modeling the Calibration Pipeline of the Lytro Camera for High Quality Light-Field Image Reconstruction [pdf]
Donghyeon Cho, Minhaeng Lee, Sunyeong Kim, Yu-Wing Tai |
|
Abstract: Light-field imaging systems have got much attention re- cently as the next generation camera model. A light-field imaging system consists of three parts: data acquisition, manipulation, and application. Given an acquisition sys- tem, it is important to understand how a light-field camera converts from its raw image to its resulting refocused image. In this paper, using the Lytro camera as an example, we de- scribe step-by-step procedures to calibrate a raw light-field image. In particular, we are interested in knowing the spa- tial and angular coordinates of the micro lens array and the resampling process for image reconstruction. Since Lytro uses a hexagonal arrangement of a micro lens image, ad- ditional treatments in calibration are required. After cali- bration, we analyze and compare the performances of sev- eral resampling methods for image reconstruction with and without calibration. Finally, a learning based interpolation method is proposed which demonstrates a higher quality image reconstruction than previous interpolation methods including a method used in Lytro software.
|
Similar papers:
[rank all papers by similarity to this]
|
|
A Learning-Based Approach to Reduce JPEG Artifacts in Image Matting [pdf]
Inchang Choi, Sunyeong Kim, Michael S. Brown, Yu-Wing Tai |
|
Abstract: Single image matting techniques assume high-quality in- put images. The vast majority of images on the web and in personal photo collections are encoded using JPEG com- pression. JPEG images exhibit quantization artifacts that adversely affect the performance of matting algorithms.
To address this situation, we propose a learning-based post-processing method to improve the alpha mattes ex- tracted from JPEG images. Our approach learns a set of sparse dictionaries from training examples that are used to transfer details from high-quality alpha mattes to alpha mattes corrupted by JPEG compression. Three different dictionaries are defined to accommodate different object structure (long hair, short hair, and sharp boundaries). A back-projection criteria combined within an MRF frame- work is used to automatically select the best dictionary to apply on the objects local boundary. We demonstrate that our method can produces superior results over existing state-of-the-art matting algorithms on a variety of inputs and compression levels.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Segmentation Driven Object Detection with Fisher Vectors [pdf]
Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid |
|
Abstract: We present an object detection system based on the Fisher vector (FV) image representation computed over SIFT and color descriptors. For computational and stor- age efficiency, we use a recent segmentation-based method to generate class-independent object detection hypotheses, in combination with data compression techniques. Our main contribution is a method to produce tentative object segmentation masks to suppress background clutter in the features. Re-weighting the local image features based on these masks is shown to improve object detection signifi- cantly. We also exploit contextual features in the form of a full-image FV descriptor, and an inter-category rescoring mechanism. Our experiments on the PASCAL VOC 2007 and 2010 datasets show that our detector improves over the current state-of-the-art detection results.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Cosegmentation and Cosketch by Unsupervised Learning [pdf]
Jifeng Dai, Ying Nian Wu, Jie Zhou, Song-Chun Zhu |
|
Abstract: Cosegmentation refers to the problem of segmenting mul- tiple images simultaneously by exploiting the similarities between the foreground and background regions in these images. The key issue in cosegmentation is to align com- mon objects between these images. To address this issue, we propose an unsupervised learning framework for coseg- mentation, by coupling cosegmentation with what we call cosketch. The goal of cosketch is to automatically dis- cover a codebook of deformable shape templates shared by the input images. These shape templates capture distinct image patterns and each template is matched to similar im- age patches in different images. Thus the cosketch of the images helps to align foreground objects, thereby providing crucial information for cosegmentation. We present a sta- tistical model whose energy function couples cosketch and cosegmentation. We then present an unsupervised learn- ing algorithm that performs cosketch and cosegmentation by energy minimization. Experiments show that our method outperforms state of the art methods for cosegmentation on the challenging MSRC and iCoseg datasets. We also illustrate our method on a new dataset called Coseg-Rep where cosegmentation can be performed within a single im- age with repetitive patterns.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: This paper investigates the problem of semi-supervised classification. Unlike previous methods to regularize clas- sifying boundaries with unlabeled data, our method learns a new image representation from all available data (labeled and unlabeled) and performs plain supervised learning with the new feature. In particular, an ensemble of image pro- totype sets are sampled automatically from the available data, to represent a rich set of visual categories/attributes. Discriminative functions are then learned on these proto- type sets, and image are represented by the concatenation of their projected values onto the prototypes (similarities to them) for further classification. Experiments on four standard datasets show three interesting phenomena: (1) our method consistently outperforms previous methods for semi-supervised image classification; (2) our method lets it- self combine well with these methods; and (3) our method works well for self-taught image classification where unla- beled data are not coming from the same distribution as la- beled ones, but rather from a random collection of images.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Example-Based Facade Texture Synthesis [pdf]
Dengxin Dai, Hayko Riemenschneider, Gerhard Schmitt, Luc Van_Gool |
|
Abstract: There is an increased interest in the efficient creation of city models, be it virtual or as-built. We present a method for synthesizing complex, photo-realistic facade images, from a single example. After parsing the example image into its semantic components, a tiling for it is generated. Novel tilings can then be created, yielding facade textures with different dimensions or with occluded parts inpainted. A genetic algorithm guides the novel facades as well as inpainted parts to be consistent with the example, both in terms of their overall structure and their detailed textures. Promising results for multiple standard datasets in partic- ular for the different building styles they contain demon- strate the potential of the method.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Photo-sequencing is the problem of recovering the tem- poral order of a set of still images of a dynamic event, taken asynchronously by a set of uncalibrated cameras. Solving this problem is a first, crucial step for analyzing (or vi- sualizing) the dynamic content of the scene captured by a large number of freely moving spectators. We propose a geometric based solution, followed by rank aggregation to the photo-sequencing problem. Our algorithm trades spa- tial certainty for temporal certainty. Whereas the previous solution proposed by [4] relies on two images taken from the same static camera to eliminate uncertainty in space, we drop the static-camera assumption and replace it with temporal information available from images taken from the same (moving) camera. Our method thus overcomes the limitation of the static-camera assumption, and scales much better with the duration of the event and the spread of cam- eras in space. We present successful results on challenging real data sets and large scale synthetic data (250 images).
|
Similar papers:
[rank all papers by similarity to this]
|
|
Visual Reranking through Weakly Supervised Multi-graph Learning [pdf]
Cheng Deng, Rongrong Ji, Wei Liu, Dacheng Tao, Xinbo Gao |
|
Abstract: Visual reranking has been widely deployed to refine the quality of conventional content-based image retrieval en- gines. The current trend lies in employing a crowd of re- trieved results stemming from multiple feature modalities to boost the overall performance of visual reranking. Howev- er, a major challenge pertaining to current reranking meth- ods is how to take full advantage of the complementary property of distinct feature modalities. Given a query im- age and one feature modality, a regular visual reranking framework treats the top-ranked images as pseudo positive instances which are inevitably noisy, difficult to reveal this complementary property, and thus lead to inferior ranking performance. This paper proposes a novel image rerank- ing approach by introducing a Co-Regularized Multi-Graph Learning (Co-RMGL) framework, in which the intra-graph and inter-graph constraints are simultaneously imposed to encode affinities in a single graph and consistency across d- ifferent graphs. Moreover, weakly supervised learning driv- en by image attributes is performed to denoise the pseudo- labeled instances, thereby highlighting the unique strength of individual feature modality. Meanwhile, such learning can yield a few anchors in graphs that vitally enable the alignment and fusion of multiple graphs. As a result, an edge weight matrix learned from the fused graph automat- ically gives the ordering to the initially retrieved results. We evaluate our approach on four benchmark
|
Similar papers:
[rank all papers by similarity to this]
|
|
Detecting Dynamic Objects with Multi-view Background Subtraction [pdf]
Raul Diaz, Sam Hallman, Charless C. Fowlkes |
|
Abstract: The confluence of robust algorithms for structure from motion along with high-coverage mapping and imaging of the world around us suggests that it will soon be feasible to accurately estimate camera pose for a large class pho- tographs taken in outdoor, urban environments. In this pa- per, we investigate how such information can be used to improve the detection of dynamic objects such as pedes- trians and cars. First, we show that when rough camera location is known, we can utilize detectors that have been trained with a scene-specific background model in order to improve detection accuracy. Second, when precise camera pose is available, dense matching to a database of exist- ing images using multi-view stereo provides a way to elim- inate static backgrounds such as building facades, akin to background-subtraction often used in video analysis. We evaluate these ideas using a dataset of tourist photos with estimated camera pose. For template-based pedestrian de- tection, we achieve a 50 percent boost in average precision over baseline.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Like Father, Like Son: Facial Expression Dynamics for Kinship Verification [pdf]
Hamdi Dibeklioglu, Albert Ali Salah, Theo Gevers |
|
Abstract: Kinship verification from facial appearance is a difficult problem. This paper explores the possibility of employing facial expression dynamics in this problem. By using fea- tures that describe facial dynamics and spatio-temporal ap- pearance over smile expressions, we show that it is possible to improve the state of the art in this problem, and verify that it is indeed possible to recognize kinship by resemblance of facial expressions. The proposed method is tested on differ- ent kin relationships. On the average, 72.89% verification accuracy is achieved on spontaneous smiles.
|
Similar papers:
[rank all papers by similarity to this]
|
|
The Way They Move: Tracking Multiple Targets with Similar Appearance [pdf]
Caglayan Dicle, Octavia I. Camps, Mario Sznaier |
|
Abstract: We introduce a computationally efficient algorithm for multi-object tracking by detection that addresses four main challenges: appearance similarity among targets, missing data due to targets being out of the field of view or oc- cluded behind other objects, crossing trajectories, and cam- era motion. The proposed method uses motion dynamics as a cue to distinguish targets with similar appearance, min- imize target mis-identification and recover missing data. Computational efficiency is achieved by using a General- ized Linear Assignment (GLA) coupled with efficient proce- dures to recover missing data and estimate the complexity of the underlying dynamics. The proposed approach works with tracklets of arbitrary length and does not assume a dynamical model a priori, yet it captures the overall mo- tion dynamics of the targets. Experiments using challenging videos show that this framework can handle complex target motions, non-stationary cameras and long occlusions, on scenarios where appearance cues are not available or poor.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Facial Action Unit Event Detection by Cascade of Tasks [pdf]
Xiaoyu Ding, Wen-Sheng Chu, Fernando De_La_Torre, Jeffery F. Cohn, Qiao Wang |
|
Abstract: Automatic facial Action Unit (AU) detection from video is a long-standing problem in facial expression analysis. AU detection is typically posed as a classification problem between frames or segments of positive examples and neg- ative ones, where existing work emphasizes the use of dif- ferent features or classifiers. In this paper, we propose a method called Cascade of Tasks (CoT) that combines the use of different tasks (i.e., frame, segment and transition) for AU event detection. We train CoT in a sequential manner embracing diversity, which ensures robustness and general- ization to unseen data. In addition to conventional frame- based metrics that evaluate frames independently, we pro- pose a new event-based metric to evaluate detection perfor- mance at event-level. We show how the CoT method con- sistently outperforms state-of-the-art approaches in both frame-based and event-based metrics, across three public datasets that differ in complexity: CK+, FERA and RU- FACS.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification [pdf]
Mandar Dixit, Nikhil Rasiwasia, Nuno Vasconcelos |
|
Abstract: An extension of the latent Dirichlet allocation (LDA), denoted class-specific-simplex LDA (css-LDA), is proposed for image classification. An analysis of the supervised LDA models currently used for this task shows that the impact of class information on the topics discovered by these models is very weak in general. This implies that the discovered topics are driven by general image regularities, rather than the semantic regularities of interest for classification. To address this, we introduce a model that induces supervision in topic discovery, while retaining the original flexibility of LDA to account for unanticipated structures of interest. The proposed css-LDA is an LDA model with class supervision at the level of image features. In css-LDA topics are dis- covered per class, i.e. a single set of topics shared across classes is replaced by multiple class-specific topic sets. This model can be used for generative classification using the Bayes decision rule or even extended to discriminative clas- sification with support vector machines (SVMs). A css-LDA model can endow an image with a vector of class and topic specific count statistics that are similar to the Bag-of-words (BoW) histogram. SVM-based discriminants can be learned for classes in the space of these histograms. The effec- tiveness of css-LDA model in both generative and discrim- inative classification frameworks is demonstrated through an extensive experimental evaluation, involving multiple benchmark datasets
|
Similar papers:
[rank all papers by similarity to this]
|
|
Multi-view Object Segmentation in Space and Time [pdf]
Abdelaziz Djelouah, Jean-Sebastien Franco, Edmond Boyer, Francois Le_Clerc, Patrick Perez |
|
Abstract: In this paper, we address the problem of object segmen- tation in multiple views or videos when two or more view- points of the same scene are available. We propose a new approach that propagates segmentation coherence informa- tion in both space and time, hence allowing evidences in one image to be shared over the complete set. To this aim the segmentation is cast as a single efficient labeling prob- lem over space and time with graph cuts. In contrast to most existing multi-view segmentation methods that rely on some form of dense reconstruction, ours only requires a sparse 3D sampling to propagate information between viewpoints. The approach is thoroughly evaluated on standard multi- view datasets, as well as on videos. With static views, results compete with state of the art methods but they are achieved with significantly fewer viewpoints. With multiple videos, we report results that demonstrate the benefit of segmenta- tion propagation through temporal cues.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Edge detection is a critical component of many vision systems, including object detectors and image segmentation algorithms. Patches of edges exhibit well-known forms of local structure, such as straight lines or T-junctions. In this paper we take advantage of the structure present in local image patches to learn both an accurate and computation- ally efficient edge detector. We formulate the problem of predicting local edge masks in a structured learning frame- work applied to random decision forests. Our novel ap- proach to learning decision trees robustly maps the struc- tured labels to a discrete space on which standard infor- mation gain measures may be evaluated. The result is an approach that obtains realtime performance that is orders of magnitude faster than many competing state-of-the-art approaches, while also achieving state-of-the-art edge de- tection results on the BSDS500 Segmentation dataset and NYU Depth dataset. Finally, we show the potential of our approach as a general purpose edge detector by showing our learned edge models generalize well across datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
|
A Deformable Mixture Parsing Model with Parselets [pdf]
Jian Dong, Qiang Chen, Wei Xia, Zhongyang Huang, Shuicheng Yan |
|
Abstract: In this work, we address the problem of human pars- ing, namely partitioning the human body into semantic re- gions, by using the novel Parselet representation. Previous works often consider solving the problem of human pose estimation as the prerequisite of human parsing. We ar- gue that these approaches cannot obtain optimal pixel level parsing due to the inconsistent targets between these tasks. In this paper, we propose to use Parselets as the build- ing blocks of our parsing model. Parselets are a group of parsable segments which can generally be obtained by low- level over-segmentation algorithms and bear strong seman- tic meaning. We then build a Deformable Mixture Pars- ing Model (DMPM) for human parsing to simultaneously handle the deformation and multi-modalities of Parselets. The proposed model has two unique characteristics: (1) the possible numerous modalities of Parselet ensembles are ex- hibited as the And-Or structure of sub-trees; (2) to fur- ther solve the practical problem of Parselet occlusion or absence, we directly model the visibility property at some leaf nodes. The DMPM thus directly solves the problem of human parsing by searching for the best graph configura- tion from a pool of Parselet hypotheses without intermediate tasks. Comprehensive evaluations demonstrate the encour- aging performance of the proposed approach.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Stable Hyper-pooling and Query Expansion for Event Detection [pdf]
Matthijs Douze, Jerome Revaud, Cordelia Schmid, Herve Jegou |
|
Abstract: This paper makes two complementary contributions to event retrieval in large collections of videos. First, we propose hyper-pooling strategies that encode the frame de- scriptors into a representation of the video sequence in a stable manner. Our best choices compare favorably with regular pooling techniques based on k-means quantization. Second, we introduce a technique to improve the ranking. It can be interpreted either as a query expansion method or as a similarity adaptation based on the local context of the query video descriptor. Experiments on public bench- marks show that our methods are complementary and im- prove event retrieval results, without sacrificing efficiency.
|
Similar papers:
[rank all papers by similarity to this]
|
|
PixelTrack: A Fast Adaptive Algorithm for Tracking Non-rigid Objects [pdf]
Stefan Duffner, Christophe Garcia |
|
Abstract: In this paper, we present a novel algorithm for fast track- ing of generic objects in videos. The algorithm uses two components: a detector that makes use of the generalised Hough transform with pixel-based descriptors, and a prob- abilistic segmentation method based on global models for foreground and background. These components are used for tracking in a combined way, and they adapt each other in a co-training manner. Through effective model adapta- tion and segmentation, the algorithm is able to track ob- jects that undergo rigid and non-rigid deformations and considerable shape and appearance variations. The pro- posed tracking method has been thoroughly evaluated on challenging standard videos, and outperforms state-of-the- art tracking methods designed for the same task. Finally, the proposed models allow for an extremely efficient imple- mentation, and thus tracking is very fast.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Accurate Blur Models vs. Image Priors in Single Image Super-resolution [pdf]
Netalee Efrat, Daniel Glasner, Alexander Apartsin, Boaz Nadler, Anat Levin |
|
Abstract: Over the past decade, single image Super-Resolution (SR) research has focused on developing sophisticated im- age priors, leading to significant advances. Estimating and incorporating the blur model, that relates the high-res and low-res images, has received much less attention, however. In particular, the reconstruction constraint, namely that the blurred and downsampled high-res output should approxi- mately equal the low-res input image, has been either ig- nored or applied with default fixed blur models. In this work, we examine the relative importance of the image prior and the reconstruction constraint. First, we show that an accurate reconstruction constraint combined with a simple gradient regularization achieves SR results almost as good as those of state-of-the-art algorithms with sophisticated image priors. Second, we study both empirically and the- oretically the sensitivity of SR algorithms to the blur model assumed in the reconstruction constraint. We find that an accurate blur model is more important than a sophisticated image prior. Finally, using real camera data, we demon- strate that the default blur models of various SR algorithms may differ from the camera blur, typically leading to over- smoothed results. Our findings highlight the importance of accurately estimating camera blur in reconstructing raw low- res images acquired by an actual camera.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Restoring an Image Taken through a Window Covered with Dirt or Rain [pdf]
David Eigen, Dilip Krishnan, Rob Fergus |
|
Abstract: Photographs taken through a window are often compro- mised by dirt or rain present on the window surface. Com- mon cases of this include pictures taken from inside a ve- hicle, or outdoor security cameras mounted inside a pro- tective enclosure. At capture time, defocus can be used to remove the artifacts, but this relies on achieving a shallow depth-of-field and placement of the camera close to the win- dow. Instead, we present a post-capture image processing solution that can remove localized rain and dirt artifacts from a single image. We collect a dataset of clean/corrupted image pairs which are then used to train a specialized form of convolutional neural network. This learns how to map corrupted image patches to clean ones, implicitly capturing the characteristic appearance of dirt and water droplets in natural images. Our models demonstrate effective removal of dirt and rain in outdoor test conditions.
|
Similar papers:
[rank all papers by similarity to this]
|
|
On the Mean Curvature Flow on Graphs with Applications in Image and Manifold Processing [pdf]
Abdallah El_Chakik, Abderrahim Elmoataz, Ahcene Sadi |
|
Abstract: In this paper, we propose an adaptation and transcrip- tion of the mean curvature level set equation on a general discrete domain (weighted graphs with arbitrary topology). We introduce the perimeters on graph using difference oper- ators and define the curvature as the first variation of these perimeters. Our proposed approach of mean curvature uni- fies both local and non local notions of mean curvature on Euclidean domains. Furthermore, it allows the extension to the processing of manifolds and data which can be repre- sented by graphs.
|
Similar papers:
[rank all papers by similarity to this]
|
|
A Convex Optimization Framework for Active Learning [pdf]
Ehsan Elhamifar, Guillermo Sapiro, Allen Yang, S. Shankar Sasrty |
|
Abstract: In many image/video/web classification problems, we have access to a large number of unlabeled samples. How- ever, it is typically expensive and time consuming to obtain labels for the samples. Active learning is the problem of progressively selecting and annotating the most informa- tive unlabeled samples, in order to obtain a high classi- fication performance. Most existing active learning algo- rithms select only one sample at a time prior to retraining the classifier. Hence, they are computationally expensive and cannot take advantage of parallel labeling systems such as Mechanical Turk. On the other hand, algorithms that allow the selection of multiple samples prior to retraining the classifier, may select samples that have significant in- formation overlap or they involve solving a non-convex op- timization. More importantly, the majority of active learn- ing algorithms are developed for a certain classifier type such as SVM. In this paper, we develop an efficient active learning framework based on convex programming, which can select multiple samples at a time for annotation. Unlike the state of the art, our algorithm can be used in conjunc- tion with any type of classifiers, including those of the fam- ily of the recently proposed Sparse Representation-based Classification (SRC). We use the two principles of classi- fier uncertainty and sample diversity in order to guide the optimization program towards selecting the most informa- tive unlabeled samples, which have th
|
Similar papers:
[rank all papers by similarity to this]
|
|
Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions [pdf]
Mohamed Elhoseiny, Babak Saleh, Ahmed Elgammal |
|
Abstract: The main question we address in this paper is how to use purely textual description of categories with no training images to learn visual classifiers for these categories. We propose an approach for zero-shot learning of object cat- egories where the description of unseen categories comes in the form of typical text such as an encyclopedia entry, without the need to explicitly defined attributes. We pro- pose and investigate two baseline formulations, based on regression and domain adaptation. Then, we propose a new constrained optimization formulation that combines a re- gression function and a knowledge transfer function with additional constraints to predict the classifier parameters for new classes. We applied the proposed approach on two fine-grained categorization datasets, and the results indi- cate successful classifier prediction.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: The vast majority of work on motion segmentation adopts the affine camera model due to its simplicity. Under the affine model, the motion segmentation problem becomes that of subspace separation. Due to this assumption, such methods are mainly offline and exhibit poor performance when the assumption is not satisfied. This is made evident in state-of-the-art methods that relax this assumption by using piecewise affine spaces and spectral clustering techniques to achieve better results. In this paper, we formulate the problem of motion segmentation as that of manifold sep- aration. We then show how label propagation can be used in an online framework to achieve manifold separation. The performance of our framework is evaluated on a benchmark dataset and achieves competitive performance while being online.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: We propose a fundamentally novel approach to real-time visual odometry for a monocular camera. It allows to ben- efit from the simplicity and accuracy of dense tracking which does not depend on visual features while running in real-time on a CPU. The key idea is to continuously esti- mate a semi-dense inverse depth map for the current frame, which in turn is used to track the motion of the camera using dense image alignment. More specifically, we estimate the depth of all pixels which have a non-negligible image gradi- ent. Each estimate is represented as a Gaussian probability distribution over the inverse depth. We propagate this in- formation over time, and update it with new measurements as new images arrive. In terms of tracking accuracy and computational speed, the proposed method compares favor- ably to both state-of-the-art dense and feature-based visual odometry and SLAM algorithms. As our method runs in real-time on a CPU, it is of large practical value for robotics and augmented reality applications.
1. Towards Dense Monocular Visual Odometry
Tracking a hand-held camera and recovering the three- dimensional structure of the environment in real-time is among the most prominent challenges in computer vision. In the last years, dense approaches to these challenges have become increasingly popular: Instead of operating solely on visual feature positions, they reconstruct and track on the whole image using a surface-based map and thereby are fundamentally different
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: We extend patch based methods to work on patches in 3D space. We start with Coherency Sensitive Hashing [12] (CSH), which is an algorithm for matching patches between two RGB images, and extend it to work with RGBD im- ages. This is done by warping all 3D patches to a com- mon virtual plane in which CSH is performed. To avoid noise due to warping of patches of various normals and depths, we estimate a group of dominant planes and com- pute CSH on each plane separately, before merging the matching patches. The result is DCSH - an algorithm that matches world (3D) patches in order to guide the search for image plane matches. An independent contribution is an ex- tension of CSH, which we term Social-CSH. It allows a ma- jor speedup of the k nearest neighbor (kNN) version of CSH - its runtime growing linearly, rather than quadratically, in k. Social-CSH is used as a subcomponent of DCSH when many NNs are required, as in the case of image denoising. We show the benefits of using depth information to image re- construction and image denoising, demonstrated on several RGBD images.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Given a set of images which share an object from the same semantic category, we would like to co-segment the shared object. We define good co-segments to be ones which can be easily composed (like a puzzle) from large pieces of other co-segments, yet are difficult to compose from remaining image parts. These pieces must not only match well but also be statistically significant (hard to com- pose at random). This gives rise to co-segmentation of ob- jects in very challenging scenarios with large variations in appearance, shape and large amounts of clutter. We further show how multiple images can collaborate and score each others co-segments to improve the overall fidelity and accuracy of the co-segmentation. Our co-segmentation can be applied both to large image collections, as well as to very few images (where there is too little data for unsupervised learning). At the extreme, it can be applied even to a single image, to extract its co-occurring objects. Our approach obtains state-of-the-art results on benchmark datasets. We further show very encouraging co-segmentation results on the challenging PASCAL-VOC dataset.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Relative Attributes for Large-Scale Abandoned Object Detection [pdf]
Quanfu Fan, Prasad Gabbur, Sharath Pankanti |
|
Abstract: Effective reduction of false alarms in large-scale video surveillance is rather challenging, especially for applica- tions where abnormal events of interest rarely occur, such as abandoned object detection. We develop an approach to prioritize alerts by ranking them, and demonstrate its great effectiveness in reducing false positives while keep- ing good detection accuracy. Our approach benefits from a novel representation of abandoned object alerts by relative attributes, namely staticness, foregroundness and abandon- ment. The relative strengths of these attributes are quan- tified using a ranking function[19] learnt on suitably de- signed low-level spatial and temporal features.These at- tributes of varying strengths are not only powerful in dis- tinguishing abandoned objects from false alarms such as people and light artifacts, but also computationally efficient for large-scale deployment. With these features, we apply a linear ranking algorithm to sort alerts according to their relevance to the end-user. We test the effectiveness of our approach on both public data sets and large ones collected from the real world.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: This paper proposes to learn binary hash codes within a statistical learning framework, in which an upper bound of the probability of Bayes decision errors is derived for different forms of hash functions and a rigorous proof of the convergence of the upper bound is presented. Conse- quently, minimizing such an upper bound leads to consistent performance improvements of existing hash code learning algorithms, regardless of whether original algorithms are unsupervised or supervised. This paper also illustrates a fast hash coding method that exploits simple binary tests to achieve orders of magnitude improvement in coding speed as compared to projection based methods.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Unbiased Metric Learning: On the Utilization of Multiple Datasets and Web Images for Softening Bias [pdf]
Chen Fang, Ye Xu, Daniel N. Rockmore |
|
Abstract: Many standard computer vision datasets exhibit biases due to a variety of sources including illumination condi- tion, imaging system, and preference of dataset collectors. Biases like these can have downstream effects in the use of vision datasets in the construction of generalizable tech- niques, especially for the goal of the creation of a classifi- cation system capable of generalizing to unseen and novel datasets. In this work we propose Unbiased Metric Learn- ing (UML), a metric learning approach, to achieve this goal. UML operates in the following two steps: (1) By varying hyperparameters, it learns a set of less biased can- didate distance metrics on training examples from multiple biased datasets. The key idea is to learn a neighborhood for each example, which consists of not only examples of the same category from the same dataset, but those from other datasets. The learning framework is based on structural SVM. (2) We do model validation on a set of weakly-labeled web images retrieved by issuing class labels as keywords to search engine. The metric with best validation performance is selected. Although the web images sometimes have noisy labels, they often tend to be less biased, which makes them suitable for the validation set in our task. Cross-dataset im- age classification experiments are carried out. Results show significant performance improvement on four well-known computer vision datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Large-Scale Image Annotation by Efficient and Robust Kernel Metric Learning [pdf]
Zheyun Feng, Rong Jin, Anil Jain |
|
Abstract: One of the key challenges in search-based image anno- tation models is to define an appropriate similarity mea- sure between images. Many kernel distance metric learning (KML) algorithms have been developed in order to capture the nonlinear relationships between visual features and se- mantics of the images. One fundamental limitation in apply- ing KML to image annotation is that it requires converting image annotations into binary constraints, leading to a sig- nificant information loss. In addition, most KML algorithms suffer from high computational cost due to the requirement that the learned matrix has to be positive semi-definitive (PSD). In this paper, we propose a robust kernel metric learning (RKML) algorithm based on the regression tech- nique that is able to directly utilize image annotations. The proposed method is also computationally more efficient be- cause PSD property is automatically ensured by regression. We provide the theoretical guarantee for the proposed algo- rithm, and verify its efficiency and effectiveness for image annotation by comparing it to state-of-the-art approaches for both distance metric learning and image annotation.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Super-resolution via Transform-Invariant Group-Sparse Regularization [pdf]
Carlos Fernandez-Granda, Emmanuel J. Candes |
|
Abstract: We present a framework to super-resolve planar regions found in urban scenes and other man-made environments by taking into account their 3D geometry. Such regions have highly structured straight edges, but this prior is challeng- ing to exploit due to deformations induced by the projection onto the imaging plane. Our method factors out such de- formations by using recently developed tools based on con- vex optimization to learn a transform that maps the image to a domain where its gradient has a simple group-sparse structure. This allows to obtain a novel convex regularizer that enforces global consistency constraints between the edges of the image. Computational experiments with real images show that this data-driven approach to the design of regularizers promoting transform-invariant group spar- sity is very effective at high super-resolution factors. We view our approach as complementary to most recent super- resolution methods, which tend to focus on hallucinating high-frequency textures.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Mining Multiple Queries for Image Retrieval: On-the-Fly Learning of an Object-Specific Mid-level Representation [pdf]
Basura Fernando, Tinne Tuytelaars |
|
Abstract: In this paper we present a new method for object re- trieval starting from multiple query images. The use of mul- tiple queries allows for a more expressive formulation of the query object including, e.g., different viewpoints and/or viewing conditions. This, in turn, leads to more diverse and more accurate retrieval results. When no query images are available to the user, they can easily be retrieved from the internet using a standard image search engine.
In particular, we propose a new method based on pattern mining. Using the minimal description length principle, we derive the most suitable set of patterns to describe the query object, with patterns corresponding to local feature config- urations. This results in a powerful object-specific mid-level image representation.
The archive can then be searched efficiently for similar images based on this representation, using a combination of two inverted file systems. Since the patterns already encode local spatial information, good results on several standard image retrieval datasets are obtained even without costly re-ranking based on geometric verification.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Unsupervised Visual Domain Adaptation Using Subspace Alignment [pdf]
Basura Fernando, Amaury Habrard, Marc Sebban, Tinne Tuytelaars |
|
Abstract: In this paper, we introduce a new domain adaptation (DA) algorithm where the source and target domains are represented by subspaces described by eigenvectors. In this context, our method seeks a domain adaptation solution by learning a mapping function which aligns the source sub- space with the target one. We show that the solution of the corresponding optimization problem can be obtained in a simple closed form, leading to an extremely fast algorithm. We use a theoretical result to tune the unique hyperparam- eter corresponding to the size of the subspaces. We run our method on various datasets and show that, despite its intrin- sic simplicity, it outperforms state of the art DA methods.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Image Guided Depth Upsampling Using Anisotropic Total Generalized Variation [pdf]
David Ferstl, Christian Reinbacher, Rene Ranftl, Matthias Ruether, Horst Bischof |
|
Abstract: In this work we present a novel method for the chal- lenging problem of depth image upsampling. Modern depth cameras such as Kinect or Time of Flight cameras deliver dense, high quality depth measurements but are limited in their lateral resolution. To overcome this limitation we for- mulate a convex optimization problem using higher order regularization for depth image upsampling. In this opti- mization an anisotropic diffusion tensor, calculated from a high resolution intensity image, is used to guide the upsam- pling. We derive a numerical algorithm based on a primal- dual formulation that is efficiently parallelized and runs at multiple frames per second. We show that this novel up- sampling clearly outperforms state of the art approaches in terms of speed and accuracy on the widely used Middlebury 2007 datasets. Furthermore, we introduce novel datasets with highly accurate groundtruth, which, for the first time, enable to benchmark depth upsampling methods using real sensor data.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: 13 IEEE International Conference on Computer Vision
'SVVIGXIH1SQIRX -PPYQMRERX )WXMQEXMSR
%FWXVEGX
-QEKI GSPSVW EVI FMEWIH F] XLI GSPSV SJ XLI TVIZEMPMRK MPPYQMREXMSR %W WYGL XLI GSPSV EX TM\IP GERRSX EP[E]W FI YWIH HMVIGXP] MR WSPZMRK ZMWMSR XEWOW JVSQ VIGSKRMXMSR XS XVEGOMRK XS KIRIVEP WGIRI YRHIVWXERHMRK -PPYQMRERX IWXM QEXMSR EPKSVMXLQW EXXIQTX XS MRJIV XLI GSPSV SJ XLI PMKLX MR GMHIRX MR E WGIRI ERH XLIR E GSPSV GEWX VIQSZEP WXIT HMW GSYRXW XLI GSPSV FMEW HYI XS MPPYQMREXMSR ,S[IZIV HIWTMXI WYWXEMRIH VIWIEVGL WMRGI EPQSWX XLI MRGITXMSR SJ GSQTYXIV ZMWMSR TVSKVIWW LEW FIIR QSHIWX 8LI FIWX EPKSVMXLQW RS[ SJXIR FYMPX SR XST SJ I\TIRWMZI JIEXYVI I\XVEGXMSR ERH QEGLMRI PIEVRMRK EVI SRP] EFSYX X[MGI EW KSSH EW XLI WMQ TPIWX ETTVSEGLIW
8LMW TETIV MR IJJIGX [MPP WLS[ LS[ WMQTPI QSQIRX FEWIH EPKSVMXLQW WYGL EW +VE];SVPH GER [MXL XLI EHHMXMSR SJ E WMQTPI GSVVIGXMSR WXIT HIPMZIV QYGL MQTVSZIH MPPYQMRERX IWXMQEXMSR TIVJSVQERGI 8LI GSVVIGXIH +VE];SVPH EPKS VMXLQ QETW XLI QIER MQEKI GSPSV YWMRK E \IH TIV GEQ IVE \ QEXVM\ XVERWJSVQ 1SVI KIRIVEPP] SYV QSQIRX ET TVSEGL IQTPS]W WX RH ERH LMKLIV SVHIV QSQIRXW SJ GSP SVW SV JIEXYVIW WYGL EW GSPSV HIVMZEXMZIW ERH XLIWI EKEMR EVI PMRIEVP] GSVVIGXIH XS KMZI ER MPPYQMRERX IWXMQEXI 8LI UYIWXMSR SJ LS[ XS GSVVIGX XLI QSQIRXW MW ER MQTSVXERX SRI ]IX [I [MPP WLS[ E WMQTPI EPXIVREXMRK PIEWXWUYEVIW XVEMRMRK TVSGIHYVI WYJ GIW 6IQEVOEFP] EGVSWW XLI QENSV HEXEWIXW
|
Similar papers:
[rank all papers by similarity to this]
|
|
Structured Learning of Sum-of-Submodular Higher Order Energy Functions [pdf]
Alexander Fix, Thorsten Joachims, Sam Park, Ramin Zabih |
|
Abstract: Submodular functions can be exactly minimized in poly- nomial time, and the special case that graph cuts solve with max flow [19] has had significant impact in computer vi- sion [5, 21, 28]. In this paper we address the important class of sum-of-submodular (SoS) functions [2, 18], which can be efficiently minimized via a variant of max flow called submodular flow [6]. SoS functions can naturally express higher order priors involving, e.g., local image patches; however, it is difficult to fully exploit their expressive power because they have so many parameters. Rather than trying to formulate existing higher order priors as an SoS func- tion, we take a discriminative learning approach, effectively searching the space of SoS functions for a higher order prior that performs well on our training set. We adopt a structural SVM approach [15, 34] and formulate the train- ing problem in terms of quadratic programming; as a re- sult we can efficiently search the space of SoS priors via an extended cutting-plane algorithm. We also show how the state-of-the-art max flow method for vision problems [11] can be modified to efficiently solve the submodular flow problem. Experimental comparisons are made against the OpenCV implementation of the GrabCut interactive seg- mentation technique [28], which uses hand-tuned parame- ters instead of machine learning. On a standard dataset [12] our method learns higher order priors with hundreds of parameter values, and produces significantly better s
|
Similar papers:
[rank all papers by similarity to this]
|
|
Data-Driven 3D Primitives for Single Image Understanding [pdf]
David F. Fouhey, Abhinav Gupta, Martial Hebert |
|
Abstract: What primitives should we use to infer the rich 3D world behind an image? We argue that these primitives should be both visually discriminative and geometrically informa- tive and we present a technique for discovering such primi- tives. We demonstrate the utility of our primitives by using them to infer 3D surface normals given a single image. Our technique substantially outperforms the state-of-the-art and shows improved cross-dataset performance.
|
Similar papers:
[rank all papers by similarity to this]
|
|
EVSAC: Accelerating Hypotheses Generation by Modeling Matching Scores with Extreme Value Theory [pdf]
Victor Fragoso, Pradeep Sen, Sergio Rodriguez, Matthew Turk |
|
Abstract: Algorithms based on RANSAC that estimate models us- ing feature correspondences between images can slow down tremendously when the percentage of correct correspon- dences (inliers) is small. In this paper, we present a prob- abilistic parametric model that allows us to assign confi- dence values for each matching correspondence and there- fore accelerates the generation of hypothesis models for RANSAC under these conditions. Our framework lever- ages Extreme Value Theory to accurately model the statis- tics of matching scores produced by a nearest-neighbor fea- ture matcher. Using a new algorithm based on this model, we are able to estimate accurate hypotheses with RANSAC at low inlier ratios significantly faster than previous state- of-the-art approaches, while still performing comparably when the number of inliers is large. We present results of ho- mography and fundamental matrix estimation experiments for both SIFT and SURF matches that demonstrate that our method leads to accurate and fast model estimations.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Separating Reflective and Fluorescent Components Using High Frequency Illumination in the Spectral Domain [pdf]
Ying Fu, Antony Lam, Imari Sato, Takahiro Okabe, Yoichi Sato |
|
Abstract: Hyperspectral imaging is beneficial to many applica- tions but current methods do not consider fluorescent effects which are present in everyday items ranging from paper, to clothing, to even our food. Furthermore, everyday fluores- cent items exhibit a mix of reflectance and fluorescence. So proper separation of these components is necessary for an- alyzing them. In this paper, we demonstrate efficient sep- aration and recovery of reflective and fluorescent emission spectra through the use of high frequency illumination in the spectral domain. With the obtained fluorescent emis- sion spectra from our high frequency illuminants, we then present to our knowledge, the first method for estimating the fluorescent absorption spectrum of a material given its emission spectrum. Conventional bispectral measurement of absorption and emission spectra needs to examine all combinations of incident and observed light wavelengths. In contrast, our method requires only two hyperspectral im- ages. The effectiveness of our proposed methods are then evaluated through a combination of simulation and real experiments. We also demonstrate an application of our method to synthetic relighting of real scenes.
|
Similar papers:
[rank all papers by similarity to this]
|
|
A Unified Video Segmentation Benchmark: Annotation, Metrics and Analysis [pdf]
Fabio Galasso, Naveen Shankar Nagaraja, Tatiana Jimenez Cardenas, Thomas Brox, Bernt Schiele |
|
Abstract: Video segmentation research is currently limited by the lack of a benchmark dataset that covers the large variety of subproblems appearing in video segmentation and that is large enough to avoid overfitting. Consequently, there is little analysis of video segmentation which generalizes across subtasks, and it is not yet clear which and how video segmentation should leverage the information from the still-frames, as previously studied in image segmenta- tion, alongside video specific information, such as temporal volume, motion and occlusion. In this work we provide such an analysis based on annotations of a large video dataset, where each video is manually segmented by multiple per- sons. Moreover, we introduce a new volume-based metric that includes the important aspect of temporal consistency, that can deal with segmentation hierarchies, and that re- flects the tradeoff between over-segmentation and segmen- tation accuracy.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Modern descriptors like HOG and SIFT are now com- monly used in vision for pattern detection within im- age and video. From a signal processing perspective, this detection process can be efficiently posed as a cor- relation/convolution between a multi-channel image and a multi-channel detector/filter which results in a single- channel response map indicating where the pattern (e.g. object) has occurred. In this paper, we propose a novel framework for learning a multi-channel detector/filter ef- ficiently in the frequency domain, both in terms of training time and memory footprint, which we refer to as a multi- channel correlation filter. To demonstrate the effectiveness of our strategy, we evaluate it across a number of visual de- tection/localization tasks where we: (i) exhibit superior per- formance to current state of the art correlation filters, and (ii) superior computational and memory efficiencies com- pared to state of the art spatial detectors.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: We aim to decompose a global histogram representation of an image into histograms of its associated objects and re- gions. This task is formulated as an optimization problem, given a set of linear classifiers, which can effectively dis- criminate the object categories present in the image. Our decomposition bypasses harder problems associated with accurately localizing and segmenting objects. We evaluate our method on a wide variety of composite histograms, and also compare it with MRF-based solutions. In addition to merely measuring the accuracy of decomposition, we also show the utility of the estimated object and background his- tograms for the task of image classification on the PASCAL VOC 2007 dataset.
|
Similar papers:
[rank all papers by similarity to this]
|
|
A Color Constancy Model with Double-Opponency Mechanisms [pdf]
Shaobing Gao, Kaifu Yang, Chaoyi Li, Yongjie Li |
|
Abstract: The double-opponent color-sensitive cells in the primary visual cortex (V1) of the human visual system (HVS) have long been recognized as the physiological basis of color constancy. We introduce a new color constancy model by imitating the functional properties of the HVS from the reti- na to the double-opponent cells in V1. The idea behind the model originates from the observation that the color distri- bution of the responses of double-opponent cells to the input color-biased images coincides well with the light source di- rection. Then the true illuminant color of a scene is easily estimated by searching for the maxima of the separate RGB channels of the responses of double-opponent cells in the RGB space. Our systematical experimental evaluations on two commonly used image datasets show that the proposed model can produce competitive results in comparison to the complex state-of-the-art approaches, but with a simple im- plementation and without the need for training.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Discriminant Tracking Using Tensor Representation with Semi-supervised Improvement [pdf]
Jin Gao, Junliang Xing, Weiming Hu, Steve Maybank |
|
Abstract: Visual tracking has witnessed growing methods in objec- t representation, which is crucial to robust tracking. The dominant mechanism in object representation is using im- age features encoded in a vector as observations to perform tracking, without considering that an image is intrinsically a matrix, or a 2nd-order tensor. Thus approaches following this mechanism inevitably lose a lot of useful information, and therefore cannot fully exploit the spatial correlation- s within the 2D image ensembles. In this paper, we ad- dress an image as a 2nd-order tensor in its original form, and find a discriminative linear embedding space approxi- mation to the original nonlinear submanifold embedded in the tensor space based on the graph embedding framework. We specially design two graphs for characterizing the in- trinsic local geometrical structure of the tensor space, so as to retain more discriminant information when reducing the dimension along certain tensor dimensions. However, spatial correlations within a tensor are not limited to the el- ements along these dimensions. This means that some part of the discriminant information may not be encoded in the embedding space. We introduce a novel technique called semi-supervised improvement to iteratively adjust the em- bedding space to compensate for the loss of discriminant information, hence improving the performance of our track- er. Experimental results on challenging videos demonstrate the effectiveness and robustness of the prop
|
Similar papers:
[rank all papers by similarity to this]
|
|
Fine-Grained Categorization by Alignments [pdf]
E. Gavves, B. Fernando, C.G.M. Snoek, A.W.M. Smeulders, T. Tuytelaars |
|
Abstract: The aim of this paper is fine-grained categorization with- out human interaction. Different from prior work, which relies on detectors for specific object parts, we propose to localize distinctive details by roughly aligning the objects using just the overall shape, since implicit to fine-grained categorization is the existence of a super-class shape shared among all classes. The alignments are then used to trans- fer part annotations from training images to test images (supervised alignment), or to blindly yet consistently seg- ment the object in a number of regions (unsupervised align- ment). We furthermore argue that in the distinction of fine- grained sub-categories, classification-oriented encodings like Fisher vectors are better suited for describing local- ized information than popular matching oriented features like HOG. We evaluate the method on the CU-2011 Birds and Stanford Dogs fine-grained datasets, outperforming the state-of-the-art.
|
Similar papers:
[rank all papers by similarity to this]
|
|
SIFTpack: A Compact Representation for Efficient SIFT Matching [pdf]
Alexandra Gilinsky, Lihi Zelnik Manor |
|
Abstract: Computing distances between large sets of SIFT descrip- tors is a basic step in numerous algorithms in computer vision. When the number of descriptors is large, as is of- ten the case, computing these distances can be extremely time consuming. In this paper we propose the SIFTpack: a compact way of storing SIFT descriptors, which enables significantly faster calculations between sets of SIFTs than the current solutions. SIFTpack can be used to represent SIFTs densely extracted from a single image or sparsely from multiple different images. We show that the SIFTpack representation saves both storage space and run time, for both finding nearest neighbors and for computing all dis- tances between all descriptors. The usefulness of SIFTpack is also demonstrated as an alternative implementation for K-means dictionaries of visual words.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: In this paper, we show how to train a deformable part model (DPM) fasttypically in less than 20 minutes, or four times faster than the current fastest methodwhile maintaining high average precision on the PASCAL VOC datasets. At the core of our approach is latent LDA, a novel generalization of linear discriminant analysis for learning latent variable models. Unlike latent SVM, latent LDA uses efficient closed-form updates and does not re- quire an expensive search for hard negative examples. Our approach also acts as a springboard for a detailed experi- mental study of DPM training. We isolate and quantify the impact of key training factors for the first time (e.g., How important are discriminative SVM filters? How important is joint parameter estimation? How many negative images are needed for training?). Our findings yield useful insights for researchers working with Markov random fields and part- based models, and have practical implications for speeding up tasks such as model selection.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Hidden Factor Analysis for Age Invariant Face Recognition [pdf]
Dihong Gong, Zhifeng Li, Dahua Lin, Jianzhuang Liu, Xiaoou Tang |
|
Abstract: Age invariant face recognition has received increasing attention due to its great potential in real world applica- tions. In spite of the great progress in face recognition tech- niques, reliably recognizing faces across ages remains a dif- ficult task. The facial appearance of a person changes sub- stantially over time, resulting in significant intra-class vari- ations. Hence, the key to tackle this problem is to separate the variation caused by aging from the person-specific fea- tures that are stable. Specifically, we propose a new method, called Hidden Factor Analysis (HFA). This method captures the intuition above through a probabilistic model with two latent factors: an identity factor that is age-invariant and an age factor affected by the aging process. Then, the ob- served appearance can be modeled as a combination of the components generated based on these factors. We also de- velop a learning algorithm that jointly estimates the latent factors and the model parameters using an EM procedure. Extensive experiments on two well-known public domain face aging datasets: MORPH (the largest public face ag- ing database) and FGNET, clearly show that the proposed method achieves notable improvement over state-of-the-art algorithms.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: The problem of minimizing the Potts energy function frequently occurs in computer vision applications. One way to tackle this NP-hard problem was proposed by Kov- tun [20, 21]. It identifies a part of an optimal solution by running k maxflow computations, where k is the number of labels. The number of labeled pixels can be significant in some applications, e.g. 50-93% in our tests for stereo. We show how to reduce the runtime to O(log k) maxflow com- putations (or one parametric maxflow computation). Fur- thermore, the output of our algorithm allows to speed-up the subsequent alpha expansion for the unlabeled part, or can be used as it is for time-critical applications.
To derive our technique, we generalize the algorithm of Felzenszwalb et al. [7] for Tree Metrics. We also show a connection to k-submodular functions from combinato- rial optimization, and discuss k-submodular relaxations for general energy functions.
|
Similar papers:
[rank all papers by similarity to this]
|
|
YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition [pdf]
Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, Kate Saenko |
|
Abstract: Despite a recent push towards large-scale object recog- nition, activity recognition remains limited to narrow do- mains and small vocabularies of actions. In this paper, we tackle the challenge of recognizing and describing activi- ties in-the-wild. We present a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos of the exact activity. If it cannot find an accurate prediction for a pre-trained model, it finds a less specific answer that is also plausible from a pragmatic standpoint. We use se- mantic hierarchies learned from the data to help to choose an appropriate level of generalization, and priors learned from web-scale natural language corpora to penalize un- likely combinations of actors/actions/objects; we also use a web-scale language model to fill in novel verbs, i.e. when the verb does not appear in the training set. We evaluate our method on a large YouTube corpus and demonstrate it is able to generate short sentence descriptions of video clips better than baseline approaches.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Digital images nowadays show large appearance vari- abilities on picture styles, in terms of color tone, contrast, vignetting, and etc. These picture styles are directly re- lated to the scene radiance, image pipeline of the camera, and post processing functions (e.g., photography effect fil- ters). Due to the complexity and nonlinearity of these fac- tors, popular gradient-based image descriptors generally are not invariant to different picture styles, which could de- grade the performance for object recognition. Given that images shared online or created by individual users are taken with a wide range of devices and may be processed by various post processing functions, to find a robust ob- ject recognition system is useful and challenging. In this paper, we investigate the influence of picture styles on ob- ject recognition by making a connection between image de- scriptors and a pixel mapping function g, and accordingly propose an adaptive approach based on a g-incorporated kernel descriptor and multiple kernel learning, without es- timating or specifying the image styles used in training and testing. We conduct experiments on the Domain Adaptation data set, the Oxford Flower data set, and several variants of the Flower data set by introducing popular photography effects through post-processing. The results demonstrate that the proposed method consistently yields recognition im- provements over standard descriptors in all studied cases.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: In this paper, we present an approach to predict the extent and height of supporting surfaces such as tables, chairs, and cabinet tops from a single RGBD image. We de- fine support surfaces to be horizontal, planar surfaces that can physically support objects and humans. Given a RGBD image, our goal is to localize the height and full extent of such surfaces in 3D space. To achieve this, we created a labeling tool and annotated 1449 images with rich, com- plete 3D scene models in NYU dataset. We extract ground truth from the annotated dataset and developed a pipeline for predicting floor space, walls, the height and full extent of support surfaces. Finally we match the predicted extent with annotated scenes in training scenes and transfer the the support surface configuration from training scenes. We evaluate the proposed approach in our dataset and demon- strate its effectiveness in understanding scenes in 3D space.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Video Co-segmentation for Meaningful Action Extraction [pdf]
Jiaming Guo, Zhuwen Li, Loong-Fah Cheong, Steven Zhiying Zhou |
|
Abstract: Given a pair of videos having a common action, our goal is to simultaneously segment this pair of videos to extrac- t this common action. As a preprocessing step, we first remove background trajectories by a motion-based figure- ground segmentation. To remove the remaining background and those extraneous actions, we propose the trajectory co- saliency measure, which captures the notion that trajecto- ries recurring in all the videos should have their mutual saliency boosted. This requires a trajectory matching pro- cess which can compare trajectories with different lengths and not necessarily spatiotemporally aligned, and yet be discriminative enough despite significant intra-class varia- tion in the common action. We further leverage the graph matching to enforce geometric coherence between regions so as to reduce feature ambiguity and matching errors. Fi- nally, to classify the trajectories into common action and action outliers, we formulate the problem as a binary la- beling of a Markov Random Field, in which the data term is measured by the trajectory co-saliency and the smooth- ness term is measured by the spatiotemporal consistency between trajectories. To evaluate the performance of our framework, we introduce a dataset containing clips that have animal actions as well as human actions. Experimen- tal results show that the proposed method performs well in common action extraction.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Fibonacci Exposure Bracketing for High Dynamic Range Imaging [pdf]
Mohit Gupta, Daisuke Iso, Shree K. Nayar |
|
Abstract: Exposure bracketing for high dynamic range (HDR) imaging involves capturing several images of the scene at different exposures. If either the camera or the scene moves during capture, the captured images must be registered. Large exposure differences between bracketed images lead to inaccurate registration, resulting in artifacts such as ghosting (multiple copies of scene objects) and blur. We present two techniques, one for image capture (Fibonacci exposure bracketing) and one for image registration (gen- eralized registration), to prevent such motion-related arti- facts. Fibonacci bracketing involves capturing a sequence of images such that each exposure time is the sum of the previous N(N > 1) exposures. Generalized registration involves estimating motion between sums of contiguous sets of frames, instead of between individual frames. Together, the two techniques ensure that motion is always estimated between frames of the same total exposure time. This results in HDR images and videos which have both a large dynamic range and minimal motion-related artifacts. We show, by re- sults for several real-world indoor and outdoor scenes, that the proposed approach significantly outperforms several ex- isting bracketing schemes.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: In this paper, we investigate the properties of Lp norm (p 1) within a projection framework. We start with the KKT equations of the non-linear optimization problem and then use its key properties to arrive at an algorithm for Lp norm projection on the non-negative simplex. We compare with L1 projection which needs prior knowledge of the true norm, as well as hard thresholding based sparsification pro- posed in recent compressed sensing literature. We show performance improvements compared to these techniques across different vision applications.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Strong ambient illumination severely degrades the per- formance of structured light based techniques. This is espe- cially true in outdoor scenarios, where the structured light sources have to compete with sunlight, whose power is often 2-5 orders of magnitude larger than the projected light. In this paper, we propose the concept of light-concentration to overcome strong ambient illumination. Our key observation is that given a fixed light (power) budget, it is always better to allocate it sequentially in several portions of the scene, as compared to spreading it over the entire scene at once. For a desired level of accuracy, we show that by distributing light appropriately, the proposed approach requires 1-2 or- ders lower acquisition time than existing approaches. Our approach is illumination-adaptive as the optimal light dis- tribution is determined based on a measurement of the am- bient illumination level. Since current light sources have a fixed light distribution, we have built a prototype light source that supports flexible light distribution by control- ling the scanning speed of a laser scanner. We show several high quality 3D scanning results in a wide range of outdoor scenarios. The proposed approach will benefit 3D vision systems that need to operate outdoors under extreme ambi- ent illumination levels on a limited time and power budget.
|
Similar papers:
[rank all papers by similarity to this]
|
|
The Interestingness of Images [pdf]
Michael Gygli, Helmut Grabner, Hayko Riemenschneider, Fabian Nater, Luc Van_Gool |
|
Abstract: We investigate human interest in photos. Based on our own and others psychological experiments, we identify var- ious cues for interestingness, namely aesthetics, unusu- alness and general preferences. For the ranking of retrieved images, interestingness is more appropriate than cues pro- posed earlier. Interestingness is, for example, correlated with what people believe they will remember. This is op- posed to actual memorability, which is uncorrelated to both of them. We introduce a set of features computationally capturing the three main aspects of visual interestingness that we propose and build an interestingness predictor from them. Its performance is shown on three datasets with vary- ing context, reflecting diverse levels of prior knowledge of the viewers.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: This paper presents a new method for deblurring photos using a sharp reference example that contains some shared content with the blurry photo. Most previous deblurring methods that exploit information from other photos require an accurately registered photo of the same static scene. In contrast, our method aims to exploit reference images where the shared content may have undergone substantial photo- metric and non-rigid geometric transformations, as these are the kind of reference images most likely to be found in personal photo albums.
Our approach builds upon a recent method for example- based deblurring using non-rigid dense correspondence (NRDC) [11] and extends it in two ways. First, we suggest exploiting information from the reference image not only for blur kernel estimation, but also as a powerful local prior for the non-blind deconvolution step. Second, we introduce a simple yet robust technique for spatially varying blur es- timation, rather than assuming spatially uniform blur. Un- like the above previous method, which has proven successful only with simple deblurring scenarios, we demonstrate that our method succeeds on a variety of real-world examples. We provide quantitative and qualitative evaluation of our method and show that it outperforms the state-of-the-art.
|
Similar papers:
[rank all papers by similarity to this]
|
|
High Quality Shape from a Single RGB-D Image under Uncalibrated Natural Illumination [pdf]
Yudeog Han, Joon-Young Lee, In So Kweon |
|
Abstract: We present a novel framework to estimate detailed shape of diffuse objects with uniform albedo from a single RGB-D image. To estimate accurate lighting in natural illumination environment, we introduce a general lighting model consist- ing of two components: global and local models. The global lighting model is estimated from the RGB-D input using the low-dimensional characteristic of a diffuse reflectance model. The local lighting model represents spatially vary- ing illumination and it is estimated by using the smoothly- varying characteristic of illumination. With both the global and local lighting model, we can estimate complex light- ing variations in uncontrolled natural illumination condi- tions accurately. For high quality shape capture, a shape- from-shading approach is applied with the estimated light- ing model. Since the entire process is done with a single RGB-D input, our method is capable of capturing the high quality shape details of a dynamic object under natural illu- mination. Experimental results demonstrate the feasibility and effectiveness of our method that dramatically improves shape details of the rough depth input.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Dictionary Learning and Sparse Coding on Grassmann Manifolds: An Extrinsic Solution [pdf]
Mehrtash Harandi, Conrad Sanderson, Chunhua Shen, Brian Lovell |
|
Abstract: Recent advances in computer vision and machine learning suggest that a wide range of problems can be addressed more appropriately by considering non-Euclidean geome- try. In this paper we explore sparse dictionary learning over the space of linear subspaces, which form Riemannian structures known as Grassmann manifolds. To this end, we propose to embed Grassmann manifolds into the space of symmetric matrices by an isometric mapping, which en- ables us to devise a closed-form solution for updating a Grassmann dictionary, atom by atom. Furthermore, to han- dle non-linearity in data, we propose a kernelised version of the dictionary learning algorithm. Experiments on sev- eral classification tasks (face recognition, action recogni- tion, dynamic texture classification) show that the proposed approach achieves considerable improvements in discrim- ination accuracy, in comparison to state-of-the-art meth- ods such as kernelised Affine Hull Method and graph- embedding Grassmann discriminant analysis.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: We present a data-driven method for estimating the 3D shapes of faces viewed in single, unconstrained photos (aka in-the-wild). Our method was designed with an empha- sis on robustness and efficiency with the explicit goal of deployment in real-world applications which reconstruct and display faces in 3D. Our key observation is that for many practical applications, warping the shape of a ref- erence face to match the appearance of a query, is enough to produce realistic impressions of the querys 3D shape. Doing so, however, requires matching visual features be- tween the (possibly very different) query and reference im- ages, while ensuring that a plausible face shape is pro- duced. To this end, we describe an optimization process which seeks to maximize the similarity of appearances and depths, jointly, to those of a reference model. We describe our system for monocular face shape reconstruction and present both qualitative and quantitative experiments, com- paring our method against alternative systems, and demon- strating its capabilities. Finally, as a testament to its suit- ability for real-world applications, we offer an open, on- line implementation of our system, providing unique means of instant 3D viewing of faces appearing in web photos.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: We present an image editing tool called Content-Aware Rotation. Casually shot photos can appear tilted, and are often corrected by rotation and cropping. This trivial so- lution may remove desired content and hurt image integri- ty. Instead of doing rigid rotation, we propose a warping method that creates the perception of rotation and avoids cropping. Human vision studies suggest that the perception of rotation is mainly due to horizontal/vertical lines. We de- sign an optimization-based method that preserves the rota- tion of horizontal/vertical lines, maintains the completeness of the image content, and reduces the warping distortion. An efficient algorithm is developed to address the challeng- ing optimization. We demonstrate our content-aware rota- tion method on a variety of practical cases.
|
Similar papers:
[rank all papers by similarity to this]
|
|
PM-Huber: PatchMatch with Huber Regularization for Stereo Matching [pdf]
Philipp Heise, Sebastian Klose, Brian Jensen, Alois Knoll |
|
Abstract: Most stereo correspondence algorithms match support windows at integer-valued disparities and assume a con- stant disparity value within the support window. The re- cently proposed PatchMatch stereo algorithm [7] over- comes this limitation of previous algorithms by directly esti- mating planes. This work presents a method that integrates the PatchMatch stereo algorithm into a variational smooth- ing formulation using quadratic relaxation. The resulting algorithm allows the explicit regularization of the disparity and normal gradients using the estimated plane parame- ters. Evaluation of our method in the Middlebury bench- mark shows that our method outperforms the traditional integer-valued disparity strategy as well as the original al- gorithm and its variants in sub-pixel accurate disparity es- timation.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Real-Time Body Tracking with One Depth Camera and Inertial Sensors [pdf]
Thomas Helten, Meinard Muller, Hans-Peter Seidel, Christian Theobalt |
|
Abstract: In recent years, the availability of inexpensive depth cameras, such as the Microsoft Kinect, has boosted the re- search in monocular full body skeletal pose tracking. Un- fortunately, existing trackers often fail to capture poses where a single camera provides insufficient data, such as non-frontal poses, and all other poses with body part oc- clusions. In this paper, we present a novel sensor fusion ap- proach for real-time full body tracking that succeeds in such difficult situations. It takes inspiration from previous track- ing solutions, and combines a generative tracker and a dis- criminative tracker retrieving closest poses in a database. In contrast to previous work, both trackers employ data from a low number of inexpensive body-worn inertial sen- sors. These sensors provide reliable and complementary information when the monocular depth information alone is not sufficient. We also contribute by new algorithmic so- lutions to best fuse depth and inertial data in both trackers. One is a new visibility model to determine global body pose, occlusions and usable depth correspondences and to decide what data modality to use for discriminative tracking. We also contribute with a new inertial-based pose retrieval, and an adapted late fusion step to calculate the final body pose.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Beyond Hard Negative Mining: Efficient Detector Learning via Block-Circulant Decomposition [pdf]
Joao F. Henriques, Joao Carreira, Rui Caseiro, Jorge Batista |
|
Abstract: Competitive sliding window detectors require vast train- ing sets. Since a pool of natural images provides a nearly endless supply of negative samples, in the form of patches at different scales and locations, training with all the avail- able data is considered impractical. A staple of current ap- proaches is hard negative mining, a method of selecting rel- evant samples, which is nevertheless expensive. Given that samples at slightly different locations have overlapping sup- port, there seems to be an enormous amount of duplicated work. It is natural, then, to ask whether these redundancies can be eliminated.
In this paper, we show that the Gram matrix describing such data is block-circulant. We derive a transformation based on the Fourier transform that block-diagonalizes the Gram matrix, at once eliminating redundancies and parti- tioning the learning problem. This decomposition is valid for any dense features and several learning algorithms, and takes full advantage of modern parallel architectures. Sur- prisingly, it allows training with all the potential samples in sets of thousands of images. By considering the full set, we generate in a single shot the optimal solution, which is usually obtained only after several rounds of hard negative mining. We report speed gains on Caltech Pedestrians and INRIA Pedestrians of over an order of magnitude, allowing training on a desktop computer in a couple of minutes.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Orderless Tracking through Model-Averaged Posterior Estimation [pdf]
Seunghoon Hong, Suha Kwak, Bohyung Han |
|
Abstract: We propose a novel offline tracking algorithm based on model-averaged posterior estimation through patch match- ing across frames. Contrary to existing online and offline tracking methods, our algorithm is not based on temporally- ordered estimates of target state but attempts to select easy- to-track frames first out of the remaining ones without ex- ploiting temporal coherency of target. The posterior of the selected frame is estimated by propagating densities from the already tracked frames in a recursive manner. The den- sity propagation across frames is implemented by an ef- ficient patch matching technique, which is useful for our algorithm since it does not require motion smoothness as- sumption. Also, we present a hierarchical approach, where a small set of key frames are tracked first and non-key frames are handled by local key frames. Our tracking al- gorithm is conceptually well-suited for the sequences with abrupt motion, shot changes, and occlusion. We com- pare our tracking algorithm with existing techniques in real videos with such challenges and illustrate its superior per- formance qualitatively and quantitatively.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Tracking via Robust Multi-task Multi-view Joint Sparse Representation [pdf]
Zhibin Hong, Xue Mei, Danil Prokhorov, Dacheng Tao |
|
Abstract: Combining multiple observation views has proven bene- ficial for tracking. In this paper, we cast tracking as a novel multi-task multi-view sparse learning problem and exploit the cues from multiple views including various types of visual features, such as intensity, color, and edge, where each feature observation can be sparsely repre- sented by a linear combination of atoms from an adaptive feature dictionary. The proposed method is integrated in a particle filter framework where every view in each particle is regarded as an individual task. We jointly consider the underlying relationship between tasks across different views and different particles, and tackle it in a unified robust multi-task formulation. In addition, to capture the frequently emerging outlier tasks, we decompose the repre- sentation matrix to two collaborative components which enable a more robust and accurate approximation. We show that the proposed formulation can be efficiently solved using the Accelerated Proximal Gradient method with a small number of closed-form updates. The presented tracker is implemented using four types of features and is tested on numerous benchmark video sequences. Both the qualitative and quantitative results demonstrate the superior perfor- mance of the proposed approach compared to several state- of-the-art trackers.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Multi-scale Topological Features for Hand Posture Representation and Analysis [pdf]
Kaoning Hu, Lijun Yin |
|
Abstract: In this paper, we propose a multi-scale topological fea- ture representation for automatic analysis of hand pos- ture. Such topological features have the advantage of be- ing posture-dependent while being preserved under certain variations of illumination, rotation, personal dependency, etc. Our method studies the topology of the holes between the hand region and its convex hull. Inspired by the princi- ple of Persistent Homology, which is the theory of computa- tional topology for topological feature analysis over multi- ple scales, we construct the multi-scale Betti Numbers ma- trix (MSBNM) for the topological feature representation. In our experiments, we used 12 different hand postures and compared our features with three popular features (HOG, MCT, and Shape Context) on different data sets. In addition to hand postures, we also extend the feature representations to arm postures. The results demonstrate the feasibility and reliability of the proposed method.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Recognising Human-Object Interaction via Exemplar Based Modelling [pdf]
Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai, Shaogang Gong, Tao Xiang |
|
Abstract: Human action can be recognised from a single still im- age by modelling Human-object interaction (HOI), which infers the mutual spatial structure information between hu- man and object as well as their appearance. Existing ap- proaches rely heavily on accurate detection of human and object, and estimation of human pose. They are thus sensi- tive to large variations of human poses, occlusion and un- satisfactory detection of small size objects. To overcome this limitation, a novel exemplar based approach is pro- posed in this work. Our approach learns a set of spatial pose-object interaction exemplars, which are density func- tions describing how a person is interacting with a manip- ulated object for different activities spatially in a proba- bilistic way. A representation based on our HOI exemplar thus has great potential for being robust to the errors in human/object detection and pose estimation. A new frame- work consists of a proposed exemplar based HOI descrip- tor and an activity specific matching model that learns the parameters is formulated for robust human activity recog- nition. Experiments on two benchmark activity datasets demonstrate that the proposed approach obtains state-of- the-art performance.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Collaborative Active Learning of a Kernel Machine Ensemble for Recognition [pdf]
Gang Hua, Chengjiang Long, Ming Yang, Yan Gao |
|
Abstract: Active learning is an effective way of engaging users to interactively train models for visual recognition. The vast majority of previous works, if not all of them, focused on active learning with a single human oracle. The problem of active learning with multiple oracles in a collaborative setting has not been well explored. Moreover, most of the previous works assume that the labels provided by the hu- man oracles are noise free, which may often be violated in reality. We present a collaborative computational model for active learning with multiple human oracles. It leads to not only an ensemble kernel machine that is robust to label noises, but also a principled label quality measure to online detect irresponsible labelers. Instead of running indepen- dent active learning processes for each individual human oracle, our model captures the inherent correlations among the labelers through shared data among them. Our simula- tion experiments and experiments with real crowd-sourced noisy labels demonstrated the efficacy of our model.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Coupled Dictionary and Feature Space Learning with Applications to Cross-Domain Image Synthesis and Recognition [pdf]
De-An Huang, Yu-Chiang Frank Wang |
|
Abstract: Cross-domain image synthesis and recognition are typi- cally considered as two distinct tasks in the areas of com- puter vision and pattern recognition. Therefore, it is not clear whether approaches addressing one task can be eas- ily generalized or extended for solving the other. In this paper, we propose a unified model for coupled dictionary and feature space learning. The proposed learning model not only observes a common feature space for associating cross-domain image data for recognition purposes, the de- rived feature space is able to jointly update the dictionaries in each image domain for improved representation. This is why our method can be applied to both cross-domain image synthesis and recognition problems. Experiments on a vari- ety of synthesis and recognition tasks such as single image super-resolution, cross-view action recognition, and sketch- to-photo face recognition would verify the effectiveness of our proposed learning model.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Coupling Alignments with Recognition for Still-to-Video Face Recognition [pdf]
Zhiwu Huang, Xiaowei Zhao, Shiguang Shan, Ruiping Wang, Xilin Chen |
|
Abstract: The Still-to-Video (S2V) face recognition systems typi- cally need to match faces in low-quality videos captured un- der unconstrained conditions against high quality still face images, which is very challenging because of noise, image blur, low face resolutions, varying head pose, complex light- ing, and alignment difficulty. To address the problem, one solution is to select the frames of best quality from videos (hereinafter called quality alignment in this paper). Mean- while, the faces in the selected frames should also be geo- metrically aligned to the still faces offline well-aligned in the gallery. In this paper, we discover that the interactions among the three tasksquality alignment, geometric align- ment and face recognitioncan benefit from each other, thus should be performed jointly. With this in mind, we propose a Coupling Alignments with Recognition (CAR) method to tightly couple these tasks via low-rank regularized sparse representation in a unified framework. Our method makes the three tasks promote mutually by a joint optimization in an Augmented Lagrange Multiplier routine. Extensive experiments on two challenging S2V datasets demonstrate that our method outperforms the state-of-the-art methods impressively.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Text Localization in Natural Images Using Stroke Feature Transform and Text Covariance Descriptors [pdf]
Weilin Huang, Zhe Lin, Jianchao Yang, Jue Wang |
|
Abstract: In this paper, we present a new approach for text lo- calization in natural images, by discriminating text and non-text regions at three levels: pixel, component and text- line levels. Firstly, a powerful low-level filter called the Stroke Feature Transform (SFT) is proposed, which extends the widely-used Stroke Width Transform (SWT) by incor- porating color cues of text pixels, leading to significantly enhanced performance on inter-component separation and intra-component connection. Secondly, based on the out- put of SFT, we apply two classifiers, a text component clas- sifier and a text-line classifier, sequentially to extract text regions, eliminating the heuristic procedures that are com- monly used in previous approaches. The two classifiers are built upon two novel Text Covariance Descriptors (TCDs) that encode both the heuristic properties and the statisti- cal characteristics of text stokes. Finally, text regions are located by simply thresholding the text-line confident map. Our method was evaluated on two benchmark datasets: ICDAR 2005 and ICDAR 2011, and the corresponding F- measure values are 0.72 and 0.73, respectively, surpassing previous methods in accuracy by a large margin.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Optimal Orthogonal Basis and Image Assimilation: Motion Modeling [pdf]
Etienne Huot, Giuseppe Papari, Isabelle Herlin |
|
Abstract: This paper describes modeling and numerical computa- tion of orthogonal bases, which are used to describe im- ages and motion fields. Motion estimation from image data is then studied on subspaces spanned by these bases. A reduced model is obtained as the Galerkin projection on these subspaces of a physical model, based on Euler and optical flow equations. A data assimilation method is stud- ied, which assimilates coefficients of image data in the re- duced model in order to estimate motion coefficients. The approach is first quantified on synthetic data: it demon- strates the interest of model reduction as a compromise be- tween results quality and computational cost. Results ob- tained on real data are then displayed so as to illustrate the method.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Markov Network-Based Unified Classifier for Face Identification [pdf]
Wonjun Hwang, Kyungshik Roh, Junmo Kim |
|
Abstract: We propose a novel unifying framework using a Markov network to learn the relationship between multiple classi- fiers in face recognition. We assume that we have several complementary classifiers and assign observation nodes to the features of a query image and hidden nodes to the fea- tures of gallery images. We connect each hidden node to its corresponding observation node and to the hidden nodes of other neighboring classifiers. For each observation-hidden node pair, we collect a set of gallery candidates that are most similar to the observation instance, and the relation- ship between the hidden nodes is captured in terms of the similarity matrix between the collected gallery images. Pos- terior probabilities in the hidden nodes are computed by the belief-propagation algorithm. The novelty of the pro- posed framework is the method that takes into account the classifier dependency using the results of each neighbor- ing classifier. We present extensive results on two different evaluation protocols, known and unknown image variation tests, using three different databases, which shows that the proposed framework always leads to good accuracy in face recognition.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Assigning a visual code to a low-level image descrip- tor, which we call code assignment, is the most computa- tionally expensive part of image classification algorithms based on the bag of visual word (BoW) framework. This paper proposes a fast computation method, Neighbor-to- Neighbor (NTN) search, for this code assignment. Based on the fact that image features from an adjacent region are usually similar to each other, this algorithm effectively re- duces the cost of calculating the distance between a code- word and a feature vector. This method can be applied not only to a hard codebook constructed by vector quantization (NTN-VQ), but also to a soft codebook, a Gaussian mix- ture model (NTN-GMM). We evaluated this method on the PASCAL VOC 2007 classification challenge task. NTN-VQ reduced the assignment cost by 77.4% in super-vector cod- ing, and NTN-GMM reduced it by 89.3% in Fisher-vector coding, without any significant degradation in classification performance.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Scene Collaging: Analysis and Synthesis of Natural Images with Semantic Layers [pdf]
Phillip Isola, Ce Liu |
|
Abstract: To quickly synthesize complex scenes, digital artists of-
ten collage together visual elements from multiple sources:
for example, mountains from New Zealand behind a Scottish
castle with wisps of Saharan sand in front. In this paper, we
propose to use a similar process in order to parse a scene.
We model a scene as a collage of warped, layered objects
sampled from labeled, reference images. Each object is re-
lated to the rest by a set of support constraints. Scene pars-
ing is achieved through analysis-by-synthesis. Starting with
a dataset of labeled exemplar scenes, we retrieve a dictio-
Ce Liu Microsoft Research celiu@microsoft.com
1$234%+5()#%
*+",&$(-.%&/%0"#$#0%
!"#$#%"&''()#%
15()#%#6+,$)% 15()#74&7($()'.28% 0.$48#0+0%
Original image Edited image
nary of candidate object segments that match a query im-
9($6&5%0"#$#%%
Original image Edited image
age. We then combine elements of this set into a scene col- lage that explains the query image. Beyond just assigning object labels to pixels, scene collaging produces a lot more information such as the number of each type of object in the scene, how they support one another, the ordinal depth of each object, and, to some degree, occluded content. We exploit this representation for several applications: image editing, random scene synthesis, and image-to-anaglyph.
|
Similar papers:
[rank all papers by similarity to this]
|
|
What is the Most EfficientWay to Select Nearest Neighbor Candidates for Fast Approximate Nearest Neighbor Search? [pdf]
Masakazu Iwamura, Tomokazu Sato, Koichi Kise |
|
Abstract: Approximate nearest neighbor search (ANNS) is a basic and important technique used in many tasks such as object recognition. It involves two processes: selecting nearest neighbor candidates and performing a brute-force search of these candidates. Only the former though has scope for improvement. In most existing methods, it approximates the space by quantization. It then calculates all the distances between the query and all the quantized values (e.g., clus- ters or bit sequences), and selects a fixed number of can- didates close to the query. The performance of the method is evaluated based on accuracy as a function of the num- ber of candidates. This evaluation seems rational but poses a serious problem; it ignores the computational cost of the process of selection. In this paper, we propose a new ANNS method that takes into account costs in the selection pro- cess. Whereas existing methods employ computationally expensive techniques such as comparative sort and heap, the proposed method does not. This realizes a significantly more efficient search. We have succeeded in reducing com- putation times by one-third compared with the state-of-the- art on an experiment using 100 million SIFT features.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Real-World Normal Map Capture for Nearly Flat Reflective Surfaces [pdf]
Bastien Jacquet, Christian Hane, Kevin Koser, Marc Pollefeys |
|
Abstract: Although specular objects have gained interest in recent years, virtually no approaches exist for markerless recon- struction of reflective scenes in the wild. In this work, we present a practical approach to capturing normal maps in real-world scenes using video only. We focus on nearly pla- nar surfaces such as windows, facades from glass or metal, or frames, screens and other indoor objects and show how normal maps of these can be obtained without the use of an artificial calibration object. Rather, we track the reflections of real-world straight lines, while moving with a hand-held or vehicle-mounted camera in front of the object. In con- trast to error-prone local edge tracking, we obtain the re- flections by a robust, global segmentation technique of an ortho-rectified 3D video cube that also naturally allows ef- ficient user interaction. Then, at each point of the reflective surface, the resulting 2D-curve to 3D-line correspondence provides a novel quadratic constraint on the local surface normal. This allows to globally solve for the shape by in- tegrability and smoothness constraints and easily supports the usage of multiple lines. We demonstrate the technique on several objects and facades.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Classification cascades have been very effective for ob- ject detection. Such a cascade fails to perform well in data domains with variations in appearances that may not be captured in the training examples. This limited generaliza- tion severely restricts the domains for which they can be used effectively. A common approach to address this limi- tation is to train a new cascade of classifiers from scratch for each of the new domains. Building separate detectors for each of the different domains requires huge annotation and computational effort, making it not scalable to a large number of data domains. Here we present an algorithm for quickly adapting a pre-trained cascade of classifiers using a small number of labeled positive instances from a different yet similar data domain. In our experiments with images of human babies and human-like characters from movies, we demonstrate that the adapted cascade significantly outper- forms both of the original cascade and the one trained from scratch using the given training examples.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Coarse-to-Fine Semantic Video Segmentation Using Supervoxel Trees [pdf]
Aastha Jain, Shuanak Chatterjee, Rene Vidal |
|
Abstract: We propose an exact, general and efficient coarse-to-fine energy minimization strategy for semantic video segmenta- tion. Our strategy is based on a hierarchical abstraction of the supervoxel graph that allows us to minimize an energy defined at the finest level of the hierarchy by minimizing a series of simpler energies defined over coarser graphs. The strategy is exact, i.e., it produces the same solution as mini- mizing over the finest graph. It is general, i.e., it can be used to minimize any energy function (e.g., unary, pairwise, and higher-order terms) with any existing energy minimization algorithm (e.g., graph cuts and belief propagation). It also gives significant speedups in inference for several datasets with varying degrees of spatio-temporal continuity. We also discuss the strengths and weaknesses of our strategy rela- tive to existing hierarchical approaches, and the kinds of image and video data that provide the best speedups.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: The higher-order clustering problem arises when data is drawn from multiple subspaces or when observations fit a higher-order parametric model. Most solutions to this problem either decompose higher-order similarity measures for use in spectral clustering or explicitly use low-rank ma- trix representations. In this paper we present our approach of Sparse Grassmann Clustering (SGC) that combines at- tributes of both categories. While we decompose the higher- order similarity tensor, we cluster data by directly finding a low dimensional representation without explicitly build- ing a similarity matrix. By exploiting recent advances in online estimation on the Grassmann manifold (GROUSE) we develop an efficient and accurate algorithm that works with individual columns of similarities or partial observa- tions thereof. Since it avoids the storage and decomposition of large similarity matrices, our method is efficient, scal- able and has low memory requirements even for large-scale data. We demonstrate the performance of our SGC method on a variety of segmentation problems including planar seg- mentation of Kinect depth maps and motion segmentation of the Hopkins 155 dataset for which we achieve performance comparable to the state-of-the-art.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Predicting Sufficient Annotation Strength for Interactive Foreground Segmentation [pdf]
Suyog Dutt Jain, Kristen Grauman |
|
Abstract: The mode of manual annotation used in an interactive segmentation algorithm affects both its accuracy and ease- of-use. For example, bounding boxes are fast to supply, yet may be too coarse to get good results on difficult images; freehand outlines are slower to supply and more specific, yet they may be overkill for simple images. Whereas ex- isting methods assume a fixed form of input no matter the image, we propose to predict the tradeoff between accuracy and effort. Our approach learns whether a graph cuts seg- mentation will succeed if initialized with a given annotation mode, based on the images visual separability and fore- ground uncertainty. Using these predictions, we optimize the mode of input requested on new images a user wants segmented. Whether given a single image that should be segmented as quickly as possible, or a batch of images that must be segmented within a specified time budget, we show how to select the easiest modality that will be sufficiently strong to yield high quality segmentations. Extensive results with real users and three datasets demonstrate the impact.
|
Similar papers:
[rank all papers by similarity to this]
|
|
A Framework for Shape Analysis via Hilbert Space Embedding [pdf]
Sadeep Jayasumana, Mathieu Salzmann, Hongdong Li, Mehrtash Harandi |
|
Abstract: We propose a framework for 2D shape analysis using positive definite kernels defined on Kendalls shape mani- fold. Different representations of 2D shapes are known to generate different nonlinear spaces. Due to the nonlinear- ity of these spaces, most existing shape classification algo- rithms resort to nearest neighbor methods and to learning distances on shape spaces. Here, we propose to map shapes on Kendalls shape manifold to a high dimensional Hilbert space where Euclidean geometry applies. To this end, we introduce a kernel on this manifold that permits such a map- ping, and prove its positive definiteness. This kernel lets us extend kernel-based algorithms developed for Euclidean spaces, such as SVM, MKL and kernel PCA, to the shape manifold. We demonstrate the benefits of our approach over the state-of-the-art methods on shape classification, cluster- ing and retrieval.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Fluttering Pattern Generation Using Modified Legendre Sequence for Coded Exposure Imaging [pdf]
Hae-Gon Jeon, Joon-Young Lee, Yudeog Han, Seon Joo Kim, In So Kweon |
|
Abstract: Finding a good binary sequence is critical in determin- ing the performance of the coded exposure imaging, but pre- vious methods mostly rely on a random search for finding the binary codes, which could easily fail to find good long sequences due to the exponentially growing search space. In this paper, we present a new computationally efficient algorithm for generating the binary sequence, which is es- pecially well suited for longer sequences. We show that the concept of the low autocorrelation binary sequence that has been well exploited in the information theory community can be applied for generating the fluttering patterns of the shutter, propose a new measure of a good binary sequence, and present a new algorithm by modifying the Legendre se- quence for the coded exposure imaging. Experiments using both synthetic and real data show that our new algorithm consistently generates better binary sequences for the coded exposure problem, yielding better deblurring and resolution enhancement results compared to the previous methods for generating the binary codes.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Towards Understanding Action Recognition [pdf]
Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, Michael J. Black |
|
Abstract: Although action recognition in videos is widely studied, current methods often fail on real-world datasets. Many re- cent approaches improve accuracy and robustness to cope with challenging video sequences, but it is often unclear what affects the results most. This paper attempts to pro- vide insights based on a systematic performance evalua- tion using thoroughly-annotated data of human actions. We annotate human Joints for the HMDB dataset (J-HMDB). This annotation can be used to derive ground truth optical flow and segmentation. We evaluate current methods using this dataset and systematically replace the output of various algorithms with ground truth. This enables us to discover what is important for example, should we work on improv- ing flow algorithms, estimating human bounding boxes, or enabling pose estimation? In summary, we find that high- level pose features greatly outperform low/mid level fea- tures; in particular, pose over time is critical, but current pose estimation algorithms are not yet reliable enough to provide this information. We also find that the accuracy of a top-performing action recognition framework can be greatly increased by refining the underlying low/mid level features; this suggests it is important to improve optical flow and human detection algorithms. Our analysis and J- HMDB dataset should facilitate a deeper understanding of action recognition algorithms.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: It is known that purely low-level saliency cues such as frequency does not lead to a good salient object detection result, requiring high-level knowledge to be adopted for successful discovery of task-independent salient objects. In this paper, we propose an efficient way to combine such high-level saliency priors and low-level appearance mod- els. We obtain the high-level saliency prior with the object- ness algorithm to find potential object candidates without the need of category information, and then enforce the con- sistency among the salient regions using a Gaussian MRF with the weights scaled by diverse density that emphasizes the influence of potential foreground pixels. Our model ob- tains saliency maps that assign high scores for the whole salient object, and achieves state-of-the-art performance on benchmark datasets covering various foreground statistics.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Recent years have witnessed the success of large-scale image classification systems that are able to identify ob- jects among thousands of possible labels. However, it is yet unclear how general classifiers such as ones trained on ImageNet can be optimally adapted to specific tasks, each of which only covers a semantically related subset of all the objects in the world. It is inefficient and suboptimal to retrain classifiers whenever a new task is given, and is inapplicable when tasks are not given explicitly, but implic- itly specified as a set of image queries. In this paper we propose a novel probabilistic model that jointly identifies the underlying task and performs prediction with a linear- time probabilistic inference algorithm, given a set of query images from a latent task. We present efficient ways to esti- mate parameters for the model, and an open-source toolbox to train classifiers distributedly at a large scale. Empirical results based on the ImageNet data showed significant per- formance increase over several baseline algorithms.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: We present a linear method for global camera pose reg- istration from pairwise relative poses encoded in essential matrices. Our method minimizes an approximate geomet- ric error to enforce the triangular relationship in camera triplets. This formulation does not suffer from the typi- cal unbalanced scale problem in linear methods relying on pairwise translation direction constraints, i.e. an alge- braic error; nor the system degeneracy from collinear mo- tion. In the case of three cameras, our method provides a good linear approximation of the trifocal tensor. It can be directly scaled up to register multiple cameras. The re- sults obtained are accurate for point triangulation and can serve as a good initialization for final bundle adjustment. We evaluate the algorithm performance with different types of data and demonstrate its effectiveness. Our system pro- duces good accuracy, robustness, and outperforms some well-known systems on efficiency.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Saliency Detection via Absorbing Markov Chain [pdf]
Bowen Jiang, Lihe Zhang, Huchuan Lu, Chuan Yang, Ming-Hsuan Yang |
|
Abstract: In this paper, we formulate saliency detection via ab- sorbing Markov chain on an image graph model. We joint- ly consider the appearance divergence and spatial distri- bution of salient objects and the background. The virtual boundary nodes are chosen as the absorbing nodes in a Markov chain and the absorbed time from each transient node to boundary absorbing nodes is computed. The ab- sorbed time of transient node measures its global similar- ity with all absorbing nodes, and thus salient objects can be consistently separated from the background when the absorbed time is used as a metric. Since the time from transient node to absorbing nodes relies on the weights on the path and their spatial distance, the background region on the center of image may be salient. We further exploit the equilibrium distribution in an ergodic Markov chain to reduce the absorbed time in the long-range smooth back- ground regions. Extensive experiments on four benchmark datasets demonstrate robustness and efficiency of the pro- posed method against the state-of-the-art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Salient Region Detection by UFO: Uniqueness, Focusness and Objectness [pdf]
Peng Jiang, Haibin Ling, Jingyi Yu, Jingliang Peng |
|
Abstract: The goal of saliency detection is to locate important pix- els or regions in an image which attract humans visual at- tention the most. This is a fundamental task whose output may serve as the basis for further computer vision tasks like segmentation, resizing, tracking and so forth.
In this paper we propose a novel salient region detec- tion algorithm by integrating three important visual cues namely uniqueness, focusness and objectness (UFO). In particular, uniqueness captures the appearance-derived vi- sual contrast; focusness reflects the fact that salient regions are often photographed in focus; and objectness helps keep completeness of detected salient regions. While uniqueness has been used for saliency detection for long, it is new to integrate focusness and objectness for this purpose. In fac- t, focusness and objectness both provide important salien- cy information complementary of uniqueness. In our ex- periments using public benchmark datasets, we show that, even with a simple pixel level combination of the three com- ponents, the proposed approach yields significant improve- ment compared with previously reported methods.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Complementary Projection Hashing [pdf]
Zhongming Jin, Yao Hu, Yue Lin, Debing Zhang, Shiding Lin, Deng Cai, Xuelong Li |
|
Abstract:
|
Similar papers:
[rank all papers by similarity to this]
|
|
Human Attribute Recognition by Rich Appearance Dictionary [pdf]
Jungseock Joo, Shuo Wang, Song-Chun Zhu |
|
Abstract: We present a part-based approach to the problem of hu- man attribute recognition from a single image of a human body. To recognize the attributes of human from the body parts, it is important to reliably detect the parts. This is a challenging task due to the geometric variation such as articulation and view-point changes as well as the appear- ance variation of the parts arisen from versatile clothing types. The prior works have primarily focused on handling geometric variation by relying on pre-trained part detectors or pose estimators, which require manual part annotation, but the appearance variation has been relatively neglected in these works. This paper explores the importance of the appearance variation, which is directly related to the main task, attribute recognition. To this end, we propose to learn a rich appearance part dictionary of human with signifi- cantly less supervision by decomposing image lattice into overlapping windows at multiscale and iteratively refining local appearance templates. We also present quantitative results in which our proposed method outperforms the ex- isting approaches.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: In underwater environments, cameras need to be con- fined in an underwater housing, viewing the scene through a piece of glass. In case of flat port underwater housings, light rays entering the camera housing are refracted twice, due to different medium densities of water, glass, and air. This causes the usually linear rays of light to bend and the commonly used pinhole camera model to be invalid. When using the pinhole camera model without explicitly model- ing refraction in Structure-from-Motion (SfM) methods, a systematic model error occurs. Therefore, in this paper, we propose a system for computing camera path and 3D points with explicit incorporation of refraction using new meth- ods for pose estimation. Additionally, a new error function is introduced for non-linear optimization, especially bundle adjustment. The proposed method allows to increase recon- struction accuracy and is evaluated in a set of experiments, where the proposed methods performance is compared to SfM with the perspective camera model.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: We address the problem of 3D scene labeling in a struc- tured learning framework. Unlike previous work which uses structured Support Vector Machines, we employ the recently described Decision Tree Field and Regression Tree Field frameworks, which learn the unary and binary terms of a Conditional Random Field from training data. We show this has significant advantages in terms of inference speed, while maintaining similar accuracy. We also demonstrate empirically the importance for overall labeling accuracy of features that make use of prior knowledge about the coarse scene layout such as the location of the ground plane. We show how this coarse layout can be estimated by our frame- work automatically, and that this information can be used to bootstrap improved accuracy in the detailed labeling.
|
Similar papers:
[rank all papers by similarity to this]
|
|
From Where and How to What We See [pdf]
S. Karthikeyan, Vignesh Jagadeesh, Renuka Shenoy, Miguel Ecksteinz, B.S. Manjunath |
|
Abstract: Eye movement studies have confirmed that overt atten- tion is highly biased towards faces and text regions in im- ages. In this paper we explore a novel problem of predicting face and text regions in images using eye tracking data from multiple subjects. The problem is challenging as we aim to predict the semantics (face/text/background) only from eye tracking data without utilizing any image information. The proposed algorithm spatially clusters eye tracking data ob- tained in an image into different coherent groups and sub- sequently models the likelihood of the clusters containing faces and text using a fully connected Markov Random Field (MRF). Given the eye tracking data from a test image, it pre- dicts potential face/head (humans, dogs and cats) and text locations reliably. Furthermore, the approach can be used to select regions of interest for further analysis by object de- tectors for faces and text. The hybrid eye position/object de- tector approach achieves better detection performance and reduced computation time compared to using only the ob- ject detection algorithm. We also present a new eye tracking dataset on 300 images selected from ICDAR, Street-view, Flickr and Oxford-IIIT Pet Dataset from 15 subjects.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Drosophila Embryo Stage Annotation Using Label Propagation [pdf]
Tomas Kazmar, Evgeny Z. Kvon, Alexander Stark, Christoph H. Lampert |
|
Abstract: In this work we propose a system for automatic classi- fication of Drosophila embryos into developmental stages. While the system is designed to solve an actual problem in biological research, we believe that the principle underly- ing it is interesting not only for biologists, but also for re- searchers in computer vision.
The main idea is to combine two orthogonal sources of information: one is a classifier trained on strongly invari- ant features, which makes it applicable to images of very different conditions, but also leads to rather noisy predic- tions. The other is a label propagation step based on a more powerful similarity measure that however is only consistent within specific subsets of the data at a time.
In our biological setup, the information sources are the shape and the staining patterns of embryo images. We show experimentally that while neither of the methods can be used by itself to achieve satisfactory results, their combina- tion achieves prediction quality comparable to human per- formance.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: In this paper we present a new concept of building a mor- phable model directly from photos on the Internet. Mor- phable models have shown very impressive results more than a decade ago, and could potentially have a huge im- pact on all aspects of face modeling and recognition. One of the challenges, however, is to capture and register 3D laser scans of large number of people and facial expres- sions. Nowadays, there are enormous amounts of face pho- tos on the Internet, large portion of which has semantic la- bels. We propose a framework to build a morphable model directly from photos, the framework includes dense regis- tration of Internet photos, as well as, new single view shape reconstruction and modification algorithms.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Modifying the Memorability of Face Photographs [pdf]
Aditya Khosla, Wilma A. Bainbridge, Antonio Torralba, Aude Oliva |
|
Abstract: Contemporary life bombards us with many new images of faces every day, which poses non-trivial constraints on human memory. The vast majority of face photographs are intended to be remembered, either because of personal rel- evance, commercial interests or because the pictures were deliberately designed to be memorable. Can we make a por- trait more memorable or more forgettable automatically? Here, we provide a method to modify the memorability of individual face photographs, while keeping the identity and other facial traits (e.g. age, attractiveness, and emotional magnitude) of the individual fixed. We show that face pho- tographs manipulated to be more memorable (or more for- gettable) are indeed more often remembered (or forgotten) in a crowd-sourcing experiment with an accuracy of 74%. Quantifying and modifying the memorability of a face lends itself to many useful applications in computer vision and graphics, such as mnemonic aids for learning, photo editing applications for social networks and tools for de- signing memorable advertisements.
|
Similar papers:
[rank all papers by similarity to this]
|
|
A Joint Intensity and Depth Co-sparse Analysis Model for Depth Map Super-resolution [pdf]
Martin Kiechle, Simon Hawe, Martin Kleinsteuber |
|
Abstract: High-resolution depth maps can be inferred from low- resolution depth measurements and an additional high- resolution intensity image of the same scene. To that end, we introduce a bimodal co-sparse analysis model, which is able to capture the interdependency of registered intensity and depth information. This model is based on the assump- tion that the co-supports of corresponding bimodal image structures are aligned when computed by a suitable pair of analysis operators. No analytic form of such operators ex- ist and we propose a method for learning them from a set of registered training signals. This learning process is done offline and returns a bimodal analysis operator that is uni- versally applicable to natural scenes. We use this to exploit the bimodal co-sparse analysis model as a prior for solving inverse problems, which leads to an efficient algorithm for depth map super-resolution.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Scene understanding is an important yet very challeng- ing problem in computer vision. In the past few years, re- searchers have taken advantage of the recent diffusion of depth-RGB (RGB-D) cameras to help simplify the problem of inferring scene semantics. However, while the added 3D geometry is certainly useful to segment out objects with dif- ferent depth values, it also adds complications in that the 3D geometry is often incorrect because of noisy depth mea- surements and the actual 3D extent of the objects is usually unknown because of occlusions. In this paper we propose a new method that allows us to jointly refine the 3D recon- struction of the scene (raw depth values) while accurately segmenting out the objects or scene elements from the 3D reconstruction. This is achieved by introducing a new model which we called Voxel-CRF. The Voxel-CRF model is based on the idea of constructing a conditional random field over a 3D volume of interest which captures the semantic and 3D geometric relationships among different elements (vox- els) of the scene. Such model allows to jointly estimate (1) a dense voxel-based 3D reconstruction and (2) the semantic labels associated with each voxel even in presence of par- tial occlusions using an approximate yet efficient inference strategy. We evaluated our method on the challenging NYU Depth dataset (Version 1 and 2). Experimental results show that our method achieves competitive accuracy in inferring scene semantics and visually appeali
|
Similar papers:
[rank all papers by similarity to this]
|
|
Curvature-Aware Regularization on Riemannian Submanifolds [pdf]
Kwang In Kim, James Tompkin, Christian Theobalt |
|
Abstract: One fundamental assumption in object recognition as well as in other computer vision and pattern recognition problems is that the data generation process lies on a man- ifold and that it respects the intrinsic geometry of the man- ifold. This assumption is held in several successful al- gorithms for diffusion and regularization, in particular, in graph-Laplacian-based algorithms. We claim that the per- formance of existing algorithms can be improved if we ad- ditionally account for how the manifold is embedded within the ambient space, i.e., if we consider the extrinsic geom- etry of the manifold. We present a procedure for charac- terizing the extrinsic (as well as intrinsic) curvature of a manifold M which is described by a sampled point cloud in a high-dimensional Euclidean space. Once estimated, we use this characterization in general diffusion and regular- ization on M , and form a new regularizer on a point cloud. The resulting re-weighted graph Laplacian demonstrates su- perior performance over classical graph Laplacian in semi- supervised learning and spectral clustering.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Most conventional single image deblurring methods as- sume that the underlying scene is static and the blur is caused by only camera shake. In this paper, in contrast to this restrictive assumption, we address the deblurring problem of general dynamic scenes which contain multi- ple moving objects as well as camera shake. In case of dynamic scenes, moving objects and background have dif- ferent blur motions, so the segmentation of the motion blur is required for deblurring each distinct blur motion accu- rately. Thus, we propose a novel energy model designed with the weighted sum of multiple blur data models, which estimates different motion blurs and their associated pixel- wise weights, and resulting sharp image. In this framework, the local weights are determined adaptively and get high values when the corresponding data models have high data fidelity. And, the weight information is used for the seg- mentation of the motion blur. Non-local regularization of weights are also incorporated to produce more reliable seg- mentation results. A convex optimization-based method is used for the solution of the proposed energy model. Exper- imental results demonstrate that our method outperforms conventional approaches in deblurring both dynamic scenes and static scenes.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Fingerspelling Recognition with Semi-Markov Conditional Random Fields [pdf]
Taehwan Kim, Greg Shakhnarovich, Karen Livescu |
|
Abstract: Recognition of gesture sequences is in general a very dif- ficult problem, but in certain domains the difficulty may be mitigated by exploiting the domains grammar. One such grammatically constrained gesture sequence domain is sign language. In this paper we investigate the case of finger- spelling recognition, which can be very challenging due to the quick, small motions of the fingers. Most prior work on this task has assumed a closed vocabulary of fingerspelled words; here we study the more natural open-vocabulary case, where the only domain knowledge is the possible fin- gerspelled letters and statistics of their sequences. We de- velop a semi-Markov conditional model approach, where feature functions are defined over segments of video and their corresponding letter labels. We use classifiers of let- ters and linguistic handshape features, along with expected motion profiles, to define segmental feature functions. This approach improves letter error rate (Levenshtein distance between hypothesized and correct letter sequences) from 16.3% using a hidden Markov model baseline to 11.6% us- ing the proposed semi-Markov model.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Multi-view 3D Reconstruction from Uncalibrated Radially-Symmetric Cameras [pdf]
Jae-Hak Kim, Yuchao Dai, Hongdong Li, Xin Du, Jonghyuk Kim |
|
Abstract: We present a new multi-view 3D Euclidean reconstruc- tion method for arbitrary uncalibrated radially-symmetric cameras, which needs no calibration or any camera model parameters other than radial symmetry. It is built on the radial 1D camera model [25], a unified mathematical ab- straction to different types of radially-symmetric cameras. We formulate the problem of multi-view reconstruction for radial 1D cameras as a matrix rank minimization prob- lem. Efficient implementation based on alternating direc- tion continuation is proposed to handle scalability issue for real-world applications. Our method applies to a wide range of omnidirectional cameras including both dioptric and catadioptric (central and non-central) cameras. Ad- ditionally, our method deals with complete and incomplete measurements under a unified framework elegantly. Exper- iments on both synthetic and real images from various types of cameras validate the superior performance of our new method, in terms of numerical accuracy and robustness.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Optical Flow via Locally Adaptive Fusion of Complementary Data Costs [pdf]
Tae Hyun Kim, Hee Seok Lee, Kyoung Mu Lee |
|
Abstract: Many state-of-the-art optical flow estimation algorithms optimize the data and regularization terms to solve ill-posed problems. In this paper, in contrast to the conventional op- tical flow framework that uses a single or fixed data model, we study a novel framework that employs locally varying data term that adaptively combines different multiple types of data models. The locally adaptive data term greatly re- duces the matching ambiguity due to the complementary nature of the multiple data models. The optimal number of complementary data models is learnt by minimizing the redundancy among them under the minimum description length constraint (MDL). From these chosen data models, a new optical flow estimation energy model is designed with the weighted sum of the multiple data models, and a convex optimization-based highly effective and practical solution that finds the optical flow, as well as the weights is proposed. Comparative experimental results on the Middlebury opti- cal flow benchmark show that the proposed method using the complementary data models outperforms the state-of- the art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
|
First-Photon Imaging: Scene Depth and Reflectance Acquisition from One Detected Photon per Pixel [pdf]
Ahmed Kirmani, Dongeek Shin, Dheera Venkatraman, Franco N. C. Wong, Vivek K Goyal |
|
Abstract: Capturing depth and reflectance images using active il- lumination despite the detection of little light backscattered from the scene has wide-ranging applications in computer vision. Conventionally, even with single-photon detectors, a large number of detected photons is needed at each pixel location to mitigate Poisson noise. Here, using only the first detected photon at each pixel location, we capture both the 3D structure and reflectivity of the scene, demonstrating greater photon efficiency than previous work. Our com- putational imager combines physically accurate photon- counting statistics with exploitation of spatial correlations present in real-world scenes. We experimentally achieve millimeter-accurate, sub-pulse width depth resolution and 4-bit reflectivity contrast, simultaneously, using only the first photon detection per pixel, even in the presence of high background noise. Our technique enables rapid, low-power, and noise-tolerant active optical imaging.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: We describe a structure-from-motion framework that handles generalized cameras, such as moving rolling- shutter cameras, and works at an unprecedented scale billions of images covering millions of linear kilometers of roadsby exploiting a good relative pose prior along vehicle paths. We exhibit a planet-scale, appearance- augmented point cloud constructed with our framework and demonstrate its practical use in correcting the pose of a street-level image collection.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: This work makes use of a novel, recently proposed epipo- lar constraint for computing the relative pose between two calibrated images. By enforcing the coplanarity of epipo- lar plane normal vectors, it constrains the three degrees of freedom of the relative rotation between two camera views directlyindependently of the translation.
The present paper shows how the approach can be ex- tended to n points, and translated into an efficient eigen- value minimization over the three rotational degrees of free- dom. Each iteration in the non-linear optimization has con- stant execution time, independently of the number of fea- tures. Two global optimization approaches are proposed. The first one consists of an efficient Levenberg-Marquardt scheme with randomized initial value, which already leads to stable and accurate results. The second scheme consists of a globally optimal branch-and-bound algorithm based on a bound on the eigenvalue variation derived from sym- metric eigenvalue-perturbation theory. Analysis of the cost function reveals insights into the nature of a specific rela- tive pose problem, and outlines the complexity under differ- ent conditions. The algorithm shows state-of-the-art perfor- mance w.r.t. essential-matrix based solutions, and a frame- to-frame application to a video sequence immediately leads to an alternative, real-time visual odometry solution.
Note: All algorithms in this paper are made available in the OpenGV library. Please visit http://laurentkneip.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: We present a method to identify and exploit structures that are shared across different object categories, by us- ing sparse coding to learn a shared basis for the part and root templates of Deformable Part Models (DPMs).
Our first contribution consists in using Shift-Invariant Sparse Coding (SISC) to learn mid-level elements that can translate during coding. This results in systematically better approximations than those attained using standard sparse coding. To emphasize that the learned mid-level structures are shiftable we call them shufflets.
Our second contribution consists in using the resulting score to construct probabilistic upper bounds to the exact template scores, instead of taking them at face value as is common in current works. We integrate shufflets in Dual- Tree Branch-and-Bound and cascade-DPMs and demon- strate that we can achieve a substantial acceleration, with practically no loss in performance.
|
Similar papers:
[rank all papers by similarity to this]
|
|
A New Image Quality Metric for Image Auto-denoising [pdf]
Xiangfei Kong, Kuan Li, Qingxiong Yang, Liu Wenyin, Ming-Hsuan Yang |
|
Abstract: This paper proposes a new non-reference image qual- ity metric that can be adopted by the state-of-the-art im- age/video denoising algorithms for auto-denoising. The proposed metric is extremely simple and can be imple- mented in four lines of Matlab code1. The basic assumption employed by the proposed metric is that the noise should be independent of the original image. A direct measure- ment of this dependence is, however, impractical due to the relatively low accuracy of existing denoising method. The proposed metric thus aims at maximizing the structure sim- ilarity between the input noisy image and the estimated im- age noise around homogeneous regions and the structure similarity between the input noisy image and the denoised image around highly-structured regions, and is computed as the linear correlation coefficient of the two correspond- ing structure similarity maps. Numerous experimental re- sults demonstrate that the proposed metric not only out- performs the current state-of-the-art non-reference quality metric quantitatively and qualitatively, but also better main- tains temporal coherence when used for video denoising.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Joint Learning of Discriminative Prototypes and Large Margin Nearest Neighbor Classifiers [pdf]
Martin Kostinger, Paul Wohlhart, Peter M. Roth, Horst Bischof |
|
Abstract: In this paper, we raise important issues concerning the evaluation complexity of existing Mahalanobis metric learning methods. The complexity scales linearly with the size of the dataset. This is especially cumbersome on large scale or for real-time applications with limited time bud- get. To alleviate this problem we propose to represent the dataset by a fixed number of discriminative prototypes. In particular, we introduce a new method that jointly chooses the positioning of prototypes and also optimizes the Ma- halanobis distance metric with respect to these. We show that choosing the positioning of the prototypes and learning the metric in parallel leads to a drastically reduced eval- uation effort while maintaining the discriminative essence of the original dataset. Moreover, for most problems our method performing k-nearest prototype (k-NP) classifica- tion on the condensed dataset leads to even better general- ization compared to k-NN classification using all data. Re- sults on a variety of challenging benchmarks demonstrate the power of our method. These include standard machine learning datasets as well as the challenging Public Fig- ures Face Database. On the competitive machine learning benchmarks we are comparable to the state-of-the-art while being more efficient. On the face benchmark we clearly out- perform the state-of-the-art in Mahalanobis metric learning with drastically reduced evaluation effort.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Current methods learn monolithic attribute predictors, with the assumption that a single model is sufficient to re- flect human understanding of a visual attribute. However, in reality, humans vary in how they perceive the association between a named property and image content. For example, two people may have slightly different internal models for what makes a shoe look formal, or they may disagree on which of two scenes looks more cluttered. Rather than discount these differences as noise, we propose to learn user-specific attribute models. We adapt a generic model trained with annotations from multiple users, tailoring it to satisfy user-specific labels. Furthermore, we propose novel techniques to infer user-specific labels based on tran- sitivity and contradictions in the users search history. We demonstrate that adapted attributes improve accuracy over both existing monolithic models as well as models that learn from scratch with user-specific data alone. In addition, we show how adapted attributes are useful to personalize im- age search, whether with binary or relative attributes.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Attribute Pivots for Guiding Relevance Feedback in Image Search [pdf]
Adriana Kovashka, Kristen Grauman |
|
Abstract: In interactive image search, a user iteratively refines his results by giving feedback on exemplar images. Active se- lection methods aim to elicit useful feedback, but traditional approaches suffer from expensive selection criteria and cannot predict informativeness reliably due to the impreci- sion of relevance feedback. To address these drawbacks, we propose to actively select pivot exemplars for which feed- back in the form of a visual comparison will most reduce the systems uncertainty. For example, the system might ask, Is your target image more or less crowded than this im- age? Our approach relies on a series of binary search trees in relative attribute space, together with a selection function that predicts the information gain were the user to compare his envisioned target to the next node deeper in a given attributes tree. It makes interactive search more effi- cient than existing strategiesboth in terms of the systems selection time as well as the users feedback effort.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Pose Estimation with Unknown Focal Length Using Points, Directions and Lines [pdf]
Yubin Kuang, Kalle Astrom |
|
Abstract: In this paper, we study the geometry problems of esti- mating camera pose with unknown focal length using com- bination of geometric primitives. We consider points, lines and also rich features such as quivers, i.e. points with one or more directions. We formulate the problems as polyno- mial systems where the constraints for different primitives are handled in a unified way. We develop efficient poly- nomial solvers for each of the derived cases with different combinations of primitives. The availability of these solvers enables robust pose estimation with unknown focal length for wider classes of features. Such rich features allow for fewer feature correspondences and generate larger inlier sets with higher probability. We demonstrate in synthetic experiments that our solvers are fast and numerically sta- ble. For real images, we show that our solvers can be used in RANSAC loops to provide good initial solutions.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Real-Time Solution to the Absolute Pose Problem with Unknown Radial Distortion and Focal Length [pdf]
Zuzana Kukelova, Martin Bujnak, Tomas Pajdla |
|
Abstract: The problem of determining the absolute position and orien- tation of a camera from a set of 2D-to-3D point correspon- dences is one of the most important problems in computer vision with a broad range of applications. In this paper we present a new solution to the absolute pose problem for camera with unknown radial distortion and unknown focal length from five 2D-to-3D point correspondences. Our new solver is numerically more stable, more accurate, and sig- nificantly faster than the existing state-of-the-art minimal four point absolute pose solvers for this problem. Moreover, our solver results in less solutions and can handle larger radial distortions. The new solver is straightforward and uses only simple concepts from linear algebra. Therefore it is simpler than the state-of-the-art Gro bner basis solvers. We compare our new solver with the existing state-of-the- art solvers and show its usefulness on synthetic and real datasets. 1
|
Similar papers:
[rank all papers by similarity to this]
|
|
Discriminative Label Propagation for Multi-object Tracking with Sporadic Appearance Features [pdf]
K.C. Amit Kumar, Christophe De_Vleeschouwer |
|
Abstract: Given a set of plausible detections, detected at each time instant independently, we investigate how to associate them across time. This is done by propagating labels on a set of graphs that capture how the spatio-temporal and the ap- pearance cues promote the assignment of identical or dis- tinct labels to a pair of nodes. The graph construction is driven by the locally linear embedding (LLE) of either the spatio-temporal or the appearance features associated to the detections. Interestingly, the neighborhood of a node in each appearance graph is defined to include all nodes for which the appearance feature is available (except the ones that coexist at the same time). This allows to connect the nodes that share the same appearance even if they are tem- porally distant, which gives our framework the uncommon ability to exploit the appearance features that are available only sporadically along the sequence of detections.
Once the graphs have been defined, the multi-object tracking is formulated as the problem of finding a label as- signment that is consistent with the constraints captured by each of the graphs. This results into a difference of con- vex program that can be efficiently solved. Experiments are performed on a basketball and several well-known pedes- trian datasets in order to validate the effectiveness of the proposed solution.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Camera Alignment Using Trajectory Intersections in Unsynchronized Videos [pdf]
Thomas Kuo, Santhoshkumar Sunderrajan, B.S. Manjunath |
|
Abstract: This paper addresses the novel and challenging prob- lem of aligning camera views that are unsynchronized by low and/or variable frame rates using object trajec- tories. Unlike existing trajectory-based alignment meth- ods, our method does not require frame-to-frame synchro- nization. Instead, we propose using the intersections of corresponding object trajectories to match views. To find these intersections, we introduce a novel trajectory match- ing algorithm based on matching Spatio-Temporal Con- text Graphs (STCGs). These graphs represent the distances between trajectories in time and space within a view, and are matched to an STCG from another view to find the cor- responding trajectories. To the best of our knowledge, this is one of the first attempts to align views that are unsyn- chronized with variable frame rates. The results on simu- lated and real-world datasets show trajectory intersections are a viable feature for camera alignment, and that the tra- jectory matching method performs well in real-world sce- narios.
|
Similar papers:
[rank all papers by similarity to this]
|
|
From Subcategories to Visual Composites: A Multi-level Framework for Object Detection [pdf]
Tian Lan, Michalis Raptis, Leonid Sigal, Greg Mori |
|
Abstract: The appearance of an object changes profoundly with pose, camera view and interactions of the object with other objects in the scene. This makes it challenging to learn de- tectors based on an object-level label (e.g., car). We pos- tulate that having a richer set of labelings (at different levels of granularity) for an object, including finer-grained sub- categories, consistent in appearance and view, and higher- order composites contextual groupings of objects consis- tent in their spatial layout and appearance, can significantly alleviate these problems. However, obtaining such a rich set of annotations, including annotation of an exponentially growing set of object groupings, is simply not feasible.
We propose a weakly-supervised framework for object detection where we discover subcategories and the com- posites automatically with only traditional object-level cat- egory labels as input. To this end, we first propose an exemplar-SVM-based clustering approach, with latent SVM refinement, that discovers a variable length set of discrim- inative subcategories for each object class. We then de- velop a structured model for object detection that captures interactions among object subcategories and automatically discovers semantically meaningful and discriminatively rel- evant visual composites. We show that this model produces state-of-the-art performance on UIUC phrase object detec- tion benchmark.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: This paper introduces a novel similarity learning frame- work. Working with inequality constraints involving quadruplets of images, our approach aims at efficiently modeling similarity from rich or complex semantic label relationships. From these quadruplet-wise constraints, we propose a similarity learning framework relying on a con- vex optimization scheme. We then study how our metric learning scheme can exploit specific class relationships, such as class ranking (relative attributes), and class tax- onomy. We show that classification using the learned met- rics gets improved performance over state-of-the-art meth- ods on several datasets. We also evaluate our approach in a new application to learn similarities between webpage screenshots in a fully unsupervised way.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Detecting Curved Symmetric Parts Using a Deformable Disc Model [pdf]
Tom Sie Ho Lee, Sanja Fidler, Sven Dickinson |
|
Abstract: Symmetry is a powerful shape regularity thats been ex- ploited by perceptual grouping researchers in both human and computer vision to recover part structure from an im- age without a priori knowledge of scene content. Draw- ing on the concept of a medial axis, defined as the locus of centers of maximal inscribed discs that sweep out a sym- metric part, we model part recovery as the search for a sequence of deformable maximal inscribed disc hypothe- ses generated from a multiscale superpixel segmentation, a framework proposed by [13]. However, we learn affinities between adjacent superpixels in a space thats invariant to bending and tapering along the symmetry axis, enabling us to capture a wider class of symmetric parts. Moreover, we introduce a global cost that perceptually integrates the hy- pothesis space by combining a pairwise and a higher-level smoothing term, which we minimize globally using dynamic programming. The new framework is demonstrated on two datasets, and is shown to significantly outperform the base- line [13].
|
Similar papers:
[rank all papers by similarity to this]
|
|
Deterministic Fitting of Multiple Structures Using Iterative MaxFS with Inlier Scale Estimation [pdf]
Kwang Hee Lee, Sang Wook Lee |
|
Abstract: 2013 IEEE International Conference on Computer Vision
!
=<>9*1&. >9
902*&?*&919531*=
1%@0#*&2#*>*&9(:168*=
#! " $ # %
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: In contrast to the current motion segmentation paradigm that assumes independence between the motion subspaces, we approach the motion segmentation problem by seeking the parsimonious basis set that can represent the data. Our formulation explicitly looks for the overlap between sub- spaces in order to achieve a minimal basis representation. This parsimonious basis set is important for the perfor- mance of our model selection scheme because the sharing of basis results in savings of model complexity cost. We propose the use of affinity propagation based method to de- termine the number of motion. The key lies in the incorpo- ration of a global cost model into the factor graph, serving the role of model complexity. The introduction of this global cost model requires additional message update in the factor graph. We derive an efficient update for the new messages associated with this global cost model. An important step in the use of affinity propagation is the subspace hypotheses generation. We use the row-sparse convex proxy solution as an initialization strategy. We further encourage the selec- tion of subspace hypotheses with shared basis by integrat- ing a discount scheme that lowers the factor graph facility cost based on shared basis. We verified the model selection and classification performance of our proposed method on both the original Hopkins 155 dataset and the more bal- anced Hopkins 380 dataset.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Style-Aware Mid-level Representation for Discovering Visual Connections in Space and Time [pdf]
Yong Jae Lee, Alexei A. Efros, Martial Hebert |
|
Abstract: We present a weakly-supervised visual data mining ap- proach that discovers connections between recurring mid- level visual elements in historic (temporal) and geographic (spatial) image collections, and attempts to capture the un- derlying visual style. In contrast to existing discovery meth- ods that mine for patterns that remain visually consistent throughout the dataset, our goal is to discover visual ele- ments whose appearance changes due to change in time or location; i.e., exhibit consistent stylistic variations across the label space (date or geo-location). To discover these elements, we first identify groups of patches that are style- sensitive. We then incrementally build correspondences to find the same element across the entire dataset. Finally, we train style-aware regressors that model each elements range of stylistic differences. We apply our approach to date and geo-location prediction and show substantial improve- ment over several baselines that do not model visual style. We also demonstrate the methods effectiveness on the re- lated task of fine-grained classification.
|
Similar papers:
[rank all papers by similarity to this]
|
|
A Non-parametric Bayesian Network Prior of Human Pose [pdf]
Andreas M. Lehrmann, Peter V. Gehler, Sebastian Nowozin |
|
Abstract: Having a sensible prior of human pose is a vital ingredi- ent for many computer vision applications, including track- ing and pose estimation. While the application of global non-parametric approaches and parametric models has led to some success, finding the right balance in terms of flex- ibility and tractability, as well as estimating model param- eters from data has turned out to be challenging. In this work, we introduce a sparse Bayesian network model of human pose that is non-parametric with respect to the es- timation of both its graph structure and its local distribu- tions. We describe an efficient sampling scheme for our model and show its tractability for the computation of ex- act log-likelihoods. We empirically validate our approach on the Human 3.6M dataset and demonstrate superior per- formance to global models and parametric networks. We further illustrate our models ability to represent and com- pose poses not present in the training set (compositionality) and describe a speed-accuracy trade-off that allows real- time scoring of poses.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Total Variation Regularization for Functions with Values in a Manifold [pdf]
Jan Lellmann, Evgeny Strekalovskiy, Sabrina Koetter, Daniel Cremers |
|
Abstract: While total variation is among the most popular regu- larizers for variational problems, its extension to functions with values in a manifold is an open problem. In this pa- per, we propose the first algorithm to solve such problems which applies to arbitrary Riemannian manifolds. The key idea is to reformulate the variational problem as a multil- abel optimization problem with an infinite number of labels. This leads to a hard optimization problem which can be ap- proximately solved using convex relaxation techniques. The framework can be easily adapted to different manifolds in- cluding spheres and three-dimensional rotations, and al- lows to obtain accurate solutions even with a relatively coarse discretization. With numerous examples we demon- strate that the proposed framework can be applied to varia- tional models that incorporate chromaticity values, normal fields, or camera trajectories.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Locally Affine Sparse-to-Dense Matching for Motion and Occlusion Estimation [pdf]
Marius Leordeanu, Andrei Zanfir, Cristian Sminchisescu |
|
Abstract: Estimating a dense correspondence field between suc- cessive video frames, under large displacement, is impor- tant in many visual learning and recognition tasks. We pro- pose a novel sparse-to-dense matching method for motion field estimation and occlusion detection. As an alterna- tive to the current coarse-to-fine approaches from the op- tical flow literature, we start from the higher level of sparse matching with rich appearance and geometric constraints collected over extended neighborhoods, using an occlusion aware, locally affine model. Then, we move towards the simpler, but denser classic flow field model, with an inter- polation procedure that offers a natural transition between the sparse and the dense correspondence fields. We experi- mentally demonstrate that our appearance features and our complex geometric constraints permit the correct motion es- timation even in difficult cases of large displacements and significant appearance changes. We also propose a novel classification method for occlusion detection that works in conjunction with the sparse-to-dense matching model. We validate our approach on the newly released Sintel dataset and obtain state-of-the-art results.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Sequential Bayesian Model Update under Structured Scene Prior for Semantic Road Scenes Labeling [pdf]
Evgeny Levinkov, Mario Fritz |
|
Abstract: Semantic road labeling is a key component of systems that aim at assisted or even autonomous driving. Consid- ering that such systems continuously operate in the real- world, unforeseen conditions not represented in any con- ceivable training procedure are likely to occur on a regular basis. In order to equip systems with the ability to cope with such situations, we would like to enable adaptation to such new situations and conditions at runtime.
Existing adaptive methods for image labeling either re- quire labeled data from the new condition or even operate globally on a complete test set. None of this is a desirable mode of operation for a system as described above where new images arrive sequentially and conditions may vary.
We study the effect of changing test conditions on scene labeling methods based on a new diverse street scene dataset. We propose a novel approach that can operate in such conditions and is based on a sequential Bayesian model update in order to robustly integrate the arriving im- ages into the adapting procedure.
|
Similar papers:
[rank all papers by similarity to this]
|
|
A Novel Earth Mover's Distance Methodology for Image Matching with Gaussian Mixture Models [pdf]
Peihua Li, Qilong Wang, Lei Zhang |
|
Abstract: The similarity or distance measure between Gaussian mixture models (GMMs) plays a crucial role in content- based image matching. Though the Earth Movers Dis- tance (EMD) has shown its advantages in matching his- togram features, its potentials in matching GMMs remain unclear and are not fully explored. To address this problem, we propose a novel EMD methodology for GMM matching. We rst present a sparse representation based EMD called SR-EMD by exploiting the sparse property of the underly- ing problem. SR-EMD is more efcient and robust than the conventional EMD. Second, we present two novel ground distances between component Gaussians based on the in- formation geometry. The perspective from the Riemannian geometry distinguishes the proposed ground distances from the classical entropy- or divergence-based ones. Further- more, motivated by the success of distance metric learning of vector data, we make the rst attempt to learn the EMD distance metrics between GMMs by using a simple yet ef- fective supervised pair-wise based method. It can adapt the distance metrics between GMMs to specic classica- tion tasks. The proposed method is evaluated on both simu- lated data and benchmark real databases and achieves very promising performance.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Codemaps - Segment, Classify and Search Objects Locally [pdf]
Zhenyang Li, Efstratios Gavves, Koen E.A. van_de_Sande, Cees G.M. Snoek, Arnold W.M. Smeulders |
|
Abstract: In this paper we aim for segmentation and classification of objects. We propose codemaps that are a joint formu- lation of the classification score and the local neighbor- hood it belongs to in the image. We obtain the codemap by reordering the encoding, pooling and classification steps over lattice elements. Other than existing linear decompo- sitions who emphasize only the efficiency benefits for lo- calized search, we make three novel contributions. As a preliminary, we provide a theoretical generalization of the sufficient mathematical conditions under which image en- codings and classification becomes locally decomposable. As first novelty we introduce l2 normalization for arbitrar- ily shaped image regions, which is fast enough for semantic segmentation using our Fisher codemaps. Second, using the same lattice across images, we propose kernel pooling which embeds nonlinearities into codemaps for object clas- sification by explicit or approximate feature mappings. Re- sults demonstrate that l2 normalized Fisher codemaps im- prove the state-of-the-art in semantic segmentation for PAS- CAL VOC. For object classification the addition of nonlin- earities brings us on par with the state-of-the-art, but is 3x faster. Because of the codemaps inherent efficiency, we can reach significant speed-ups for localized search as well. We exploit the efficiency gain for our third novelty: object seg- ment retrieval using a single query image only.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Contextual Hypergraph Modeling for Salient Object Detection [pdf]
Xi Li, Yao Li, Chunhua Shen, Anthony Dick, Anton Van_Den_Hengel |
|
Abstract: Salient object detection aims to locate objects that cap- ture human attention within images. Previous approaches often pose this as a problem of image contrast analysis. In this work, we model an image as a hypergraph that uti- lizes a set of hyperedges to capture the contextual proper- ties of image pixels or regions. As a result, the problem of salient object detection becomes one of finding salient ver- tices and hyperedges in the hypergraph. The main advan- tage of hypergraph modeling is that it takes into account each pixels (or regions) affinity with its neighborhood as well as its separation from image background. Further- more, we propose an alternative approach based on center- versus-surround contextual contrast analysis, which per- forms salient object detection by optimizing a cost-sensitive support vector machine (SVM) objective function. Experi- mental results on four challenging datasets demonstrate the effectiveness of the proposed approaches against the state- of-the-art approaches to salient object detection.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Dynamic Pooling for Complex Event Recognition [pdf]
Weixin Li, Qian Yu, Ajay Divakaran, Nuno Vasconcelos |
|
Abstract: The problem of adaptively selecting pooling regions for the classification of complex video events is considered. Complex events are defined as events composed of several characteristic behaviors, whose temporal configuration can change from sequence to sequence. A dynamic pooling operator is defined so as to enable a unified solution to the problems of event specific video segmentation, tempo- ral structure modeling, and event detection. Video is de- composed into segments, and the segments most informative for detecting a given event are identified, so as to dynami- cally determine the pooling operator most suited for each sequence. This dynamic pooling is implemented by treating the locations of characteristic segments as hidden informa- tion, which is inferred, on a sequence-by-sequence basis, via a large-margin classification rule with latent variables. Although the feasible set of segment selections is combina- torial, it is shown that a globally optimal solution to the inference problem can be obtained efficiently, through the solution of a series of linear programs. Besides the coarse- level location of segments, a finer model of video struc- ture is implemented by jointly pooling features of segment- tuples. Experimental evaluation demonstrates that the re- sulting event detector has state-of-the-art performance on challenging video datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: This paper introduces an automatic method for remov- ing reflection interference when imaging a scene behind a glass surface. Our approach exploits the subtle changes in the reflection with respect to the background in a small set of images taken at slightly different view points. Key to this idea is the use of SIFT-flow to align the images such that a pixel-wise comparison can be made across the input set. Gradients with variation across the image set are assumed to belong to the reflected scenes while constant gradients are assumed to belong to the desired background scene. By correctly labelling gradients belonging to reflection or background, the background scene can be separated from the reflection interference. Unlike previous approaches that exploit motion, our approach does not make any assump- tions regarding the background or reflected scenes geom- etry, nor requires the reflection to be static. This makes our approach practical for use in casual imaging scenar- ios. Our approach is straight forward and produces good results compared with existing methods.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: We present a model for gaze prediction in egocentric video by leveraging the implicit cues that exist in camera wearers behaviors. Specifically, we compute the camera wearers head motion and hand location from the video and combine them to estimate where the eyes look. We further model the dynamic behavior of the gaze, in particular fix- ations, as latent variables to improve the gaze prediction. Our gaze prediction results outperform the state-of-the-art algorithms by a large margin on publicly available egocen- tric vision datasets. In addition, we demonstrate that we get a significant performance boost in recognizing daily actions and segmenting foreground objects by plugging in our gaze predictions into state-of-the-art methods.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Log-Euclidean Kernels for Sparse Representation and Dictionary Learning [pdf]
Peihua Li, Qilong Wang, Wangmeng Zuo, Lei Zhang |
|
Abstract: The symmetric positive denite (SPD) matrices have been widely used in image and vision problems. Recently there are growing interests in studying sparse representa- tion (SR) of SPD matrices, motivated by the great success of SR for vector data. Though the space of SPD matrices is well-known to form a Lie group that is a Riemannian man- ifold, existing work fails to take full advantage of its geo- metric structure. This paper attempts to tackle this problem by proposing a kernel based method for SR and dictionary learning (DL) of SPD matrices. We disclose that the space of SPD matrices, with the operations of logarithmic multi- plication and scalar logarithmic multiplication dened in the Log-Euclidean framework, is a complete inner prod- uct space. We can thus develop a broad family of kernels that satises Mercers condition. These kernels character- ize the geodesic distance and can be computed efciently. We also consider the geometric structure in the DL process by updating atom matrices in the Riemannian space instead of in the Euclidean space. The proposed method is evalu- ated with various vision problems and shows notable per- formance gains over state-of-the-arts.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Model Recommendation with Virtual Probes for Egocentric Hand Detection [pdf]
Cheng Li, Kris M. Kitani |
|
Abstract: Egocentric cameras can be used to benefit such tasks as analyzing fine motor skills, recognizing gestures and learn- ing about hand-object manipulation. To enable such tech- nology, we believe that the hands must detected on the pixel- level to gain important information about the shape of the hands and fingers. We show that the problem of pixel-wise hand detection can be effectively solved, by posing the prob- lem as a model recommendation task. As such, the goal of a recommendation system is to recommend the n-best hand detectors based on the probe set a small amount of la- beled data from the test distribution. This requirement of a probe set is a serious limitation in many applications, such as ego-centric hand detection, where the test distribution may be continually changing. To address this limitation, we propose the use of virtual probes which can be automati- cally extracted from the test distribution. The key idea is that many features, such as the color distribution or rela- tive performance between two detectors, can be used as a proxy to the probe set. In our experiments we show that the recommendation paradigm is well-equipped to handle complex changes in the appearance of the hands in first- person vision. In particular, we show how our system is able to generalize to new scenarios by testing our model across multiple users.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Modeling Occlusion by Discriminative AND-OR Structures [pdf]
Bo Li, Wenze Hu, Tianfu Wu, Song-Chun Zhu |
|
Abstract: Occlusion presents a challenge for detecting objects in real world applications. To address this issue, this paper models object occlusion with an AND-OR structure which (i) represents occlusion at semantic part level, and (ii) cap- tures the regularities of different occlusion configurations (i.e., the different combinations of object part visibilities). This paper focuses on car detection on street. Since anno- tating part occlusion on real images is time-consuming and error-prone, we propose to learn the the AND-OR structure automatically using synthetic images of CAD models placed at different relative positions. The model parameters are learned from real images under the latent structural SVM (LSSVM) framework. In inference, an efficient dynamic pro- gramming (DP) algorithm is utilized. In experiments, we test our method on both car detection and car view estima- tion. Experimental results show that (i) Our CAD simula- tion strategy is capable of generating occlusion patterns for real scenarios, (ii) The proposed AND-OR structure model is effective for modeling occlusions, which outperforms the deformable part-based model (DPM) [6, 10] in car detec- tion on both our self-collected street parking dataset and the Pascal VOC 2007 car dataset [4], (iii) The learned model is on-par with the state-of-the-art methods on car view esti- mation tested on two public datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: This paper demonstrates how the nonlocal principle benefits video matting via the KNN Laplacian, which comes with a straightforward implementation using motion- aware K nearest neighbors. In hindsight, the fundamen- tal problem to solve in video matting is to produce spatio- temporally coherent clusters of moving foreground pixels. When used as described, the motion-aware KNN Lapla- cian is effective in addressing this fundamental problem, as demonstrated by sparse user markups typically on only one frame in a variety of challenging examples featur- ing ambiguous foreground and background colors, chang- ing topologies with disocclusion, significant illumination changes, fast motion, and motion blur. When working with existing Laplacian-based systems, our Laplacian is ex- pected to benefit them immediately with improved clustering of moving foreground pixels.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Perspective Motion Segmentation via Collaborative Clustering [pdf]
Zhuwen Li, Jiaming Guo, Loong-Fah Cheong, Steven Zhiying Zhou |
|
Abstract: This paper addresses real-world challenges in the mo- tion segmentation problem, including perspective effects, missing data, and unknown number of motions. It first for- mulates the 3-D motion segmentation from two perspective views as a subspace clustering problem, utilizing the epipo- lar constraint of an image pair. It then combines the point correspondence information across multiple image frames via a collaborative clustering step, in which tight integra- tion is achieved via a mixed norm optimization scheme. For model selection, we propose an over-segment and merge ap- proach, where the merging step is based on the property of the l1-norm of the mutual sparse representation of two over- segmented groups. The resulting algorithm can deal with incomplete trajectories and perspective effects substantial- ly better than state-of-the-art two-frame and multi-frame methods. Experiments on a 62-clip dataset show the signif- icant superiority of the proposed idea in both segmentation accuracy and model selection.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Probabilistic Elastic Part Model for Unsupervised Face Detector Adaptation [pdf]
Haoxiang Li, Gang Hua, Zhe Lin, Jonathan Brandt, Jianchao Yang |
|
Abstract: We propose an unsupervised detector adaptation algo- rithm to adapt any offline trained face detector to a specific collection of images, and hence achieve better accuracy. The core of our detector adaptation algorithm is a prob- abilistic elastic part (PEP) model, which is offline trained with a set of face examples. It produces a statistically- aligned part based face representation, namely the PEP representation. To adapt a general face detector to a col- lection of images, we compute the PEP representations of the candidate detections from the general face detector, and then train a discriminative classifier with the top positives and negatives. Then we re-rank all the candidate detections with this classifier. This way, a face detector tailored to the statistics of the specific image collection is adapted from the original detector. We present extensive results on three datasets with two state-of-the-art face detectors. The signif- icant improvement of detection accuracy over these state- of-the-art face detectors strongly demonstrates the efficacy of the proposed face detector adaptation algorithm.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Saliency Detection via Dense and Sparse Reconstruction [pdf]
Xiaohui Li, Huchuan Lu, Lihe Zhang, Xiang Ruan, Ming-Hsuan Yang |
|
Abstract: In this paper, we propose a visual saliency detection al- gorithm from the perspective of reconstruction errors. The image boundaries are first extracted via superpixels as like- ly cues for background templates, from which dense and sparse appearance models are constructed. For each im- age region, we first compute dense and sparse reconstruc- tion errors. Second, the reconstruction errors are propa- gated based on the contexts obtained from K-means cluster- ing. Third, pixel-level saliency is computed by an integra- tion of multi-scale reconstruction errors and refined by an object-biased Gaussian model. We apply the Bayes formula to integrate saliency measures based on dense and sparse reconstruction errors. Experimental results show that the proposed algorithm performs favorably against seventeen state-of-the-art methods in terms of precision and recall. In addition, the proposed algorithm is demonstrated to be more effective in highlighting salient objects uniformly and robust to background noise.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Video Segmentation by Tracking Many Figure-Ground Segments [pdf]
Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, James M. Rehg |
|
Abstract: We propose an unsupervised video segmentation ap- proach by simultaneously tracking multiple holistic figure- ground segments. Segment tracks are initialized from a pool of segment proposals generated from a figure-ground seg- mentation algorithm. Then, online non-local appearance models are trained incrementally for each track using a multi-output regularized least squares formulation. By us- ing the same set of training examples for all segment tracks, a computational trick allows us to track hundreds of seg- ment tracks efficiently, as well as perform optimal online updates in closed-form. Besides, a new composite statisti- cal inference approach is proposed for refining the obtained segment tracks, which breaks down the initial segment pro- posals and recombines for better ones by utilizing high- order statistic estimates from the appearance model and en- forcing temporal consistency. For evaluating the algorithm, a dataset, SegTrack v2, is collected with about 1,000 frames with pixel-level annotations. The proposed framework out- performs state-of-the-art approaches in the dataset, show- ing its efficiency and robustness to challenges in different video sequences.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Joint Segmentation and Pose Tracking of Human in Natural Videos [pdf]
Taegyu Lim, Seunghoon Hong, Bohyung Han, Joon Hee Han |
|
Abstract: We propose an on-line algorithm to extract a human by foreground/background segmentation and estimate pose of the human from the videos captured by moving cameras. We claim that a virtuous cycle can be created by appropri- ate interactions between the two modules to solve individ- ual problems. This joint estimation problem is divided into two subproblems, foreground/background segmentation and pose tracking, which alternate iteratively for optimization; segmentation step generates foreground mask for human pose tracking, and human pose tracking step provides fore- ground response map for segmentation. The final solution is obtained when the iterative procedure converges. We eval- uate our algorithm quantitatively and qualitatively in real videos involving various challenges, and present its out- standing performance compared to the state-of-the-art tech- niques for segmentation and pose estimation.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: We address the problem of localizing and estimating the fine-pose of objects in the image with exact 3D models. Our main focus is to unify contributions from the 1970s with re- cent advances in object detection: use local keypoint de- tectors to find candidate poses and score global alignment of each candidate pose to the image. Moreover, we also provide a new dataset containing fine-aligned objects with their exactly matched 3D models, and a set of models for widely used objects. We also evaluate our algorithm both on object detection and fine pose estimation, and show that our method outperforms state-of-the art algorithms.
|
Similar papers:
[rank all papers by similarity to this]
|
|
3D Sub-query Expansion for Improving Sketch-Based Multi-view Image Retrieval [pdf]
Yen-Liang Lin, Cheng-Yu Huang, Hao-Jeng Wang, Winston Hsu |
|
Abstract: We propose a 3D sub-query expansion approach for boosting sketch-based multi-view image retrieval. The core idea of our method is to automatically convert two (guided) 2D sketches into an approximated 3D sketch model, and then generate multi-view sketches as expanded sub-queries to improve the retrieval performance. To learn the weights among synthesized views (sub-queries), we present a new multi-query feature to model the similarity between sub- queries and dataset images, and formulate it into a convex optimization problem. Our approach shows superior per- formance compared with the state-of-the-art approach on a public multi-view image dataset. Moreover, we also conduct sensitivity tests to analyze the parameters of our approach based on the gathered user sketches.
|
Similar papers:
[rank all papers by similarity to this]
|
|
A General Two-Step Approach to Learning-Based Hashing [pdf]
Guosheng Lin, Chunhua Shen, David Suter, Anton van_den_Hengel |
|
Abstract: Most existing approaches to hashing apply a single form of hash function, and an optimization process which is typ- ically deeply coupled to this specific form. This tight cou- pling restricts the flexibility of the method to respond to the data, and can result in complex optimization problems that are difficult to solve. Here we propose a flexible yet simple framework that is able to accommodate different types of loss functions and hash functions. This framework allows a number of existing approaches to hashing to be placed in context, and simplifies the development of new problem- specific hashing methods. Our framework decomposes the hashing learning problem into two steps: hash bit learning and hash function learning based on the learned bits. The first step can typically be formulated as binary quadratic problems, and the second step can be accomplished by training standard binary classifiers. Both problems have been extensively studied in the literature. Our extensive ex- periments demonstrate that the proposed framework is ef- fective, flexible and outperforms the state-of-the-art.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Characterizing Layouts of Outdoor Scenes Using Spatial Topic Processes [pdf]
Dahua Lin, Jianxiong Xiao |
|
Abstract: In this paper, we develop a generative model to describe the layouts of outdoor scenes the spatial configuration of regions. Specifically, the layout of an image is represented as a composite of regions, each associated with a seman- tic topic. At the heart of this model is a novel stochas- tic process called Spatial Topic Process, which generates a spatial map of topics from a set of coupled Gaussian pro- cesses, thus allowing the distributions of topics to vary con- tinuously across the image plane. A key aspect that distin- guishes this model from previous ones consists in its capa- bility of capturing dependencies across both locations and topics while allowing substantial variations in the layouts. We demonstrate the practical utility of the proposed model by testing it on scene classification, semantic segmentation, and layout hallucination.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Holistic Scene Understanding for 3D Object Detection with RGBD Cameras [pdf]
Dahua Lin, Sanja Fidler, Raquel Urtasun |
|
Abstract: In this paper, we tackle the problem of indoor scene un- derstanding using RGBD data. Towards this goal, we pro- pose a holistic approach that exploits 2D segmentation, 3D geometry, as well as contextual relations between scenes and objects. Specifically, we extend the CPMC [3] frame- work to 3D in order to generate candidate cuboids, and develop a conditional random field to integrate informa- tion from different sources to classify the cuboids. With this formulation, scene classification and 3D object recognition are coupled and can be jointly solved through probabilis- tic inference. We test the effectiveness of our approach on the challenging NYU v2 dataset. The experimental results demonstrate that through effective evidence integration and holistic reasoning, our approach achieves substantial im- provement over the state-of-the-art.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Robust Non-parametric Data Fitting for Correspondence Modeling [pdf]
Wen-Yan Lin, Ming-Ming Cheng, Shuai Zheng, Jiangbo Lu, Nigel Crook |
|
Abstract: We propose a generic method for obtaining non- parametric image warps from noisy point correspondences. Our formulation integrates a huber function into a motion coherence framework. This makes our fitting function es- pecially robust to piecewise correspondence noise (where an image section is consistently mismatched). By utilizing over parameterized curves, we can generate realistic non- parametric image warps from very noisy correspondence. We also demonstrate how our algorithm can be used to help stitch images taken from a panning camera by warping the images onto a virtual push-broom camera imaging plane.
|
Similar papers:
[rank all papers by similarity to this]
|
|
A Scalable Unsupervised Feature Merging Approach to Efficient Dimensionality Reduction of High-Dimensional Visual Data [pdf]
Lingqiao Liu, Lei Wang |
|
Abstract: To achieve a good trade-off between recognition accu- racy and computational efficiency, it is often needed to re- duce high-dimensional visual data to medium-dimensional ones. For this task, even applying a simple full-matrix- based linear projection causes significant computation and memory use. When the number of visual data is large, how to efficiently learn such a projection could even become a problem. The recent feature merging approach offers an ef- ficient way to reduce the dimensionality, which only requires a single scan of features to perform reduction. However, existing merging algorithms do not scale well with high- dimensional data, especially in the unsupervised case.
To address this problem, we formulate unsupervised fea- ture merging as a PCA problem imposed with a special structure constraint. By exploiting its connection with k- means, we transform this constrained PCA problem into a feature clustering problem. Moreover, we employ the hash- ing technique to improve its scalability. These produce a scalable feature merging algorithm for our dimensional- ity reduction task. In addition, we develop an extension of this method by leveraging the neighborhood structure in the data to further improve dimensionality reduction perfor- mance. In further, we explore the incorporation of bipolar merging a variant of merging function which allows the subtraction operation into our algorithms.
Through three applications in visual recognition, we demonstrate that our
|
Similar papers:
[rank all papers by similarity to this]
|
|
Automatic Kronecker Product Model Based Detection of Repeated Patterns in 2D Urban Images [pdf]
Juan Liu, Emmanouil Psarakis, Ioannis Stamos |
|
Abstract: Repeated patterns (such as windows, tiles, balconies and doors) are prominent and significant features in urban scenes. Therefore, detection of these repeated patterns be- comes very important for city scene analysis. This paper attacks the problem of repeated patterns detection in a pre- cise, efficient and automatic way, by combining traditional feature extraction followed by a Kronecker product low- rank modeling approach. Our method is tailored for 2D im- ages of building fac ades. We have developed algorithms for automatic selection of a representative texture within fac ade images using vanishing points and Harris corners. After rectifying the input images, we describe novel algorithms that extract repeated patterns by using Kronecker product based modeling that is based on a solid theoretical founda- tion. Our approach is unique and has not ever been used for fac ade analysis. We have tested our algorithms in a large set of images.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Bird Part Localization Using Exemplar-Based Models with Enforced Pose and Subcategory Consistency [pdf]
Jiongxin Liu, Peter N. Belhumeur |
|
Abstract: In this paper, we propose a novel approach for bird part localization, targeting fine-grained categories with wide variations in appearance due to different poses (including aspect and orientation) and subcategories. As it is chal- lenging to represent such variations across a large set of di- verse samples with tractable parametric models, we turn to individual exemplars. Specifically, we extend the exemplar- based models in [4] by enforcing pose and subcategory consistency at the parts. During training, we build pose- specific detectors scoring part poses across subcategories, and subcategory-specific detectors scoring part appearance across poses. At the testing stage, likely exemplars are matched to the image, suggesting part locations whose pose and subcategory consistency are well-supported by the im- age cues. From these hypotheses, part configuration can be predicted with very high accuracy. Experimental results demonstrate significant performance gains from our method on an extensive dataset: CUB-200-2011 [30], for both lo- calization and classification tasks.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Shaky stereoscopic video is not only unpleasant to watch but may also cause 3D fatigue. Stabilizing the left and right view of a stereoscopic video separately using a monocu- lar stabilization method tends to both introduce undesir- able vertical disparities and damage horizontal disparities, which may destroy the stereoscopic viewing experience. In this paper, we present a joint subspace stabilization method for stereoscopic video. We prove that the low-rank subspace constraint for monocular video [10] also holds for stereo- scopic video. Particularly, the feature trajectories from the left and right video share the same subspace. Based on this proof, we develop a stereo subspace stabilization method that jointly computes a common subspace from the left and right video and uses it to stabilize the two videos simultane- ously. Our method meets the stereoscopic constraints with- out 3D reconstruction or explicit left-right correspondence. We test our method on a variety of stereoscopic videos with different scene content and camera motion. The experi- ments show that our method achieves high-quality stabiliza- tion for stereoscopic video in a robust and efficient way.
|
Similar papers:
[rank all papers by similarity to this]
|
|
POP: Person Re-identification Post-rank Optimisation [pdf]
Chunxiao Liu, Chen Change Loy, Shaogang Gong, Guijin Wang |
|
Abstract: Owing to visual ambiguities and disparities, person re- identification methods inevitably produce suboptimal rank- list, which still requires exhaustive human eyeballing to identify the correct target from hundreds of different likely- candidates. Existing re-identification studies focus on im- proving the ranking performance, but rarely look into the critical problem of optimising the time-consuming and error-prone post-rank visual search at the user end. In this study, we present a novel one-shot Post-rank OPtimisation (POP) method, which allows a user to quickly refine their search by either one-shot or a couple of sparse negative selections during a re-identification process. We conduct systematic behavioural studies to understand users search- ing behaviour and show that the proposed method allows correct re-identification to converge 2.6 times faster than the conventional exhaustive search. Importantly, through extensive evaluations we demonstrate that the method is ca- pable of achieving significant improvement over the state- of-the-art distance metric learning based ranking models, even with just one shot feedback optimisation, by as much as over 30% performance improvement for rank 1 re- identification on the VIPeR and i-LIDS datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Semantically-Based Human Scanpath Estimation with HMMs [pdf]
Huiying Liu, Dong Xu, Qingming Huang, Wen Li, Min Xu, Stephen Lin |
|
Abstract: We present a method for estimating human scanpaths, which are sequences of gaze shifts that follow visual atten- tion over an image. In this work, scanpaths are modeled based on three principal factors that influence human atten- tion, namely low-level feature saliency, spatial position, and semantic content. Low-level feature saliency is formulated as transition probabilities between different image regions based on feature differences. The effect of spatial position on gaze shifts is modeled as a Levy flight with the shifts following a 2D Cauchy distribution. To account for se- mantic content, we propose to use a Hidden Markov Model (HMM) with a Bag-of-Visual-Words descriptor of image re- gions. An HMM is well-suited for this purpose in that 1) the hidden states, obtained by unsupervised learning, can rep- resent latent semantic concepts, 2) the prior distribution of the hidden states describes visual attraction to the semantic concepts, and 3) the transition probabilities represent hu- man gaze shift patterns. The proposed method is applied to task-driven viewing processes. Experiments and analysis performed on human eye gaze data verify the effectiveness of this method.
|
Similar papers:
[rank all papers by similarity to this]
|
|
SGTD: Structure Gradient and Texture Decorrelating Regularization for Image Decomposition [pdf]
Qiegen Liu, Jianbo Liu, Pei Dong, Dong Liang |
|
Abstract: This paper presents a novel structure gradient and tex- ture decorrelating regularization (SGTD) for image decom- position. The motivation of the idea is under the assumption that the structure gradient and texture components should be properly decorrelated for a successful decomposition. The proposed model consists of the data fidelity term, total variation regularization and the SGTD regularization. An augmented Lagrangian method is proposed to address this optimization issue, by first transforming the unconstrained problem to an equivalent constrained problem and then ap- plying an alternating direction method to iteratively solve the subproblems. Experimental results demonstrate that the proposed method presents better or comparable perfor- mance as state-of-the-art methods do.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition [pdf]
Hans Lobel, Rene Vidal, Alvaro Soto |
|
Abstract: Currently, Bag-of-Visual-Words (BoVW) and part-based methods are the most popular approaches for visual recog- nition. In both cases, a mid-level representation is built on top of low-level image descriptors and top-level classifiers use this mid-level representation to achieve visual recog- nition. While in current part-based approaches, mid- and top-level representations are usually jointly trained, this is not the usual case for BoVW schemes. A main reason for this is the complex data association problem related to the usual large dictionary size needed by BoVW approaches. As a further observation, typical solutions based on BoVW and part-based representations are usually limited to exten- sions of binary classification schemes, a strategy that ig- nores relevant correlations among classes. In this work we propose a novel hierarchical approach to visual recognition based on a BoVW scheme that jointly learns suitable mid- and top-level representations. Furthermore, using a max- margin learning framework, the proposed approach directly handles the multiclass case at both levels of abstraction. We test our proposed method using several popular bench- mark datasets. As our main result, we demonstrate that, by coupling learning of mid- and top-level representations, the proposed approach fosters sharing of discriminative visual words among target classes, being able to achieve state-of- the-art recognition performance using far less visual words than previous approaches.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Two-Point Gait: Decoupling Gait from Body Shape [pdf]
Stephen Lombardi, Ko Nishino, Yasushi Makihara, Yasushi Yagi |
|
Abstract: Human gait modeling (e.g., for person identification) largely relies on image-based representations that muddle gait with body shape. Silhouettes, for instance, inherently entangle body shape and gait. For gait analysis and recog- nition, decoupling these two factors is desirable. Most im- portant, once decoupled, they can be combined for the task at hand, but not if left entangled in the first place. In this paper, we introduce Two-Point Gait, a gait representation that encodes the limb motions regardless of the body shape. Two-Point Gait is directly computed on the image sequence based on the two point statistics of optical flow fields. We demonstrate its use for exploring the space of human gait and gait recognition under large clothing variation. The results show that we can achieve state-of-the-art person recognition accuracy on a challenging dataset.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Active Visual Recognition with Expertise Estimation in Crowdsourcing [pdf]
Chengjiang Long, Gang Hua, Ashish Kapoor |
|
Abstract: We present a noise resilient probabilistic model for ac- tive learning of a Gaussian process classifier from crowds, i.e., a set of noisy labelers. It explicitly models both the overall label noises and the expertise level of each individ- ual labeler in two levels of flip models. Expectation propa- gation is adopted for efficient approximate Bayesian infer- ence of our probabilistic model for classification, based on which, a generalized EM algorithm is derived to estimate both the global label noise and the expertise of each indi- vidual labeler. The probabilistic nature of our model im- mediately allows the adoption of the prediction entropy and estimated expertise for active selection of data sample to be labeled, and active selection of high quality labelers to la- bel the data, respectively. We apply the proposed model for three visual recognition tasks, i.e, object category recogni- tion, gender recognition, and multi-modal activity recogni- tion, on three datasets with real crowd-sourced labels from Amazon Mechanical Turk. The experiments clearly demon- strated the efficacy of the proposed model.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Transfer Feature Learning with Joint Distribution Adaptation [pdf]
Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, Philip S. Yu |
|
Abstract: Transfer learning is established as an effective technolo- gy in computer vision for leveraging rich labeled data in the source domain to build an accurate classifier for the target domain. However, most prior methods have not simultane- ously reduced the difference in both the marginal distribu- tion and conditional distribution between domains. In this paper, we put forward a novel transfer learning approach, referred to as Joint Distribution Adaptation (JDA). Specifi- cally, JDA aims to jointly adapt both the marginal distribu- tion and conditional distribution in a principled dimension- ality reduction procedure, and construct new feature repre- sentation that is effective and robust for substantial distribu- tion difference. Extensive experiments verify that JDA can significantly outperform several state-of-the-art methods on four types of cross-domain image classification problems.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Regression-based techniques have shown promising re- sults for people counting in crowded scenes. However, most existing techniques require expensive and laborious data annotation for model training. In this study, we propose to address this problem from three perspectives: (1) Instead of exhaustively annotating every single frame, the most in- formative frames are selected for annotation automatically and actively. (2) Rather than learning from only labelled data, the abundant unlabelled data are exploited. (3) La- belled data from other scenes are employed to further al- leviate the burden for data annotation. All three ideas are implemented in a unified active and semi-supervised regres- sion framework with ability to perform transfer learning, by exploiting the underlying geometric structure of crowd pat- terns via manifold analysis. Extensive experiments validate the effectiveness of our approach.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: Speedy abnormal event detection meets the growing demand to process an enormous number of surveillance videos. Based on inherent redundancy of video structures, we propose an efficient sparse combination learning frame- work. It achieves decent performance in the detection phase without compromising result quality. The short running time is guaranteed because the new method effectively turns the original complicated problem to one in which only a few costless small-scale least square optimization steps are involved. Our method reaches high detection rates on benchmark datasets at a speed of 140150 frames per second on average when computing on an ordinary desktop PC using MATLAB.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Correlation Adaptive Subspace Segmentation by Trace Lasso [pdf]
Canyi Lu, Jiashi Feng, Zhouchen Lin, Shuicheng Yan |
|
Abstract: This paper studies the subspace segmentation problem. Given a set of data points drawn from a union of subspaces, the goal is to partition them into their underlying subspaces they were drawn from. The spectral clustering method is used as the framework. It requires to find an affinity ma- trix which is close to block diagonal, with nonzero entries corresponding to the data point pairs from the same sub- space. In this work, we argue that both sparsity and the grouping effect are important for subspace segmentation. A sparse affinity matrix tends to be block diagonal, with less connections between data points from different sub- spaces. The grouping effect ensures that the highly cor- rected data which are usually from the same subspace can be grouped together. Sparse Subspace Clustering (SSC), by using l1-minimization, encourages sparsity for data se- lection, but it lacks of the grouping effect. On the contrary, Low-Rank Representation (LRR), by rank minimization, and Least Squares Regression (LSR), by l2-regularization, ex- hibit strong grouping effect, but they are short in subset s- election. Thus the obtained affinity matrix is usually very sparse by SSC, yet very dense by LRR and LSR.
In this work, we propose the Correlation Adaptive Sub- space Segmentation (CASS) method by using trace Lasso. CASS is a data correlation dependent method which simul- taneously performs automatic data selection and groups correlated data together. It can be regarded as a method which adap
|
Similar papers:
[rank all papers by similarity to this]
|
|
Correntropy Induced L2 Graph for Robust Subspace Clustering [pdf]
Canyi Lu, Jinhui Tang, Min Lin, Liang Lin, Shuicheng Yan, Zhouchen Lin |
|
Abstract: In this paper, we study the robust subspace clustering problem, which aims to cluster the given possibly noisy da- ta points into their underlying subspaces. A large pool of previous subspace clustering methods focus on the graph construction by different regularization of the representa- tion coefficient. We instead focus on the robustness of the model to non-Gaussian noises. We propose a new robust clustering method by using the correntropy induced metric, which is robust for handling the non-Gaussian and impul- sive noises. Also we further extend the method for handling the data with outlier rows/features. The multiplicative form of half-quadratic optimization is used to optimize the non- convex correntropy objective function of the proposed mod- els. Extensive experiments on face datasets well demon- strate that the proposed methods are more robust to corrup- tions and occlusions.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: When face images are taken in the wild, the large varia- tions in facial pose, illumination, and expression make face recognition challenging. The most fundamental problem for face recognition is to measure the similarity between faces. The traditional measurements such as various mathematical norms, Hausdorff distance, and approximate geodesic distance cannot accurately capture the structural information between faces in such complex circumstances. To address this issue, we develop a novel face patch network, based on which we define a new similarity measure called the random path (RP) measure. The RP measure is derived from the collective similarity of paths by performing random walks in the network. It can globally characterize the contextual and curved structures of the face space. To apply the RP measure, we construct two kinds of networks: the in-face network and the out-face network. The in-face network is drawn from any two face images and captures the local structural information. The out-face network is constructed from all the training face patches, thereby modeling the global structures of face space. The two face networks are structurally complementary and can be combined together to improve the recognition performance. Experiments on the Multi-PIE and LFW benchmarks show that the RP measure outperforms most of the state-of-art algorithms for face recognition.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Image Set Classification Using Holistic Multiple Order Statistics Features and Localized Multi-kernel Metric Learning [pdf]
Jiwen Lu, Gang Wang, Pierre Moulin |
|
Abstract: This paper presents a new approach for image set classi- fication, where each training and testing example contains a set of image instances of an object captured from varying viewpoints or under varying illuminations. While a number of image set classification methods have been proposed in recent years, most of them model each image set as a single linear subspace or mixture of linear subspaces, which may lose some discriminative information for classification. To address this, we propose exploring multiple order statistics as features of image sets, and develop a localized multi- kernel metric learning (LMKML) algorithm to effectively combine different order statistics information for classifica- tion. Our method achieves the state-of-the-art performance on four widely used databases including the Honda/UCSD, CMU Mobo, and Youtube face datasets, and the ETH-80 object dataset.
|
Similar papers:
[rank all papers by similarity to this]
|
|
A Deep Sum-Product Architecture for Robust Facial Attributes Analysis [pdf]
Ping Luo, Xiaogang Wang, Xiaoou Tang |
|
Abstract: Recent works have shown that facial attributes are useful in a number of applications such as face recognition and retrieval. However, estimating attributes in images with large variations remains a big challenge. This challenge is addressed in this paper. Unlike existing methods that assume the independence of attributes during their esti- mation, our approach captures the interdependencies of local regions for each attribute, as well as the high-order correlations between different attributes, which makes it more robust to occlusions and misdetection of face regions. First, we have modeled region interdependencies with a discriminative decision tree, where each node consists of a detector and a classifier trained on a local region. The detector allows us to locate the region, while the classifier determines the presence or absence of an attribute. Sec- ond, correlations of attributes and attribute predictors are modeled by organizing all of the decision trees into a large sum-product network (SPN), which is learned by the EM algorithm and yields the most probable explanation (MPE) of the facial attributes in terms of the regions localization and classification. Experimental results on a large data set with 22, 400 images show the effectiveness of the proposed approach.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps [pdf]
Jiajia Luo, Wei Wang, Hairong Qi |
|
Abstract: Human action recognition based on the depth informa- tion provided by commodity depth sensors is an impor- tant yet challenging task. The noisy depth maps, differ- ent lengths of action sequences, and free styles in per- forming actions, may cause large intra-class variations. In this paper, a new framework based on sparse coding and temporal pyramid matching (TPM) is proposed for depth- based human action recognition. Especially, a discrimina- tive class-specific dictionary learning algorithm is proposed for sparse coding. By adding the group sparsity and geom- etry constraints, features can be well reconstructed by the sub-dictionary belonging to the same class, and the geom- etry relationships among features are also kept in the cal- culated coefficients. The proposed approach is evaluated on two benchmark datasets captured by depth cameras. Exper- imental results show that the proposed algorithm repeatedly achieves superior performance to the state of the art algo- rithms. Moreover, the proposed dictionary learning method also outperforms classic dictionary learning approaches.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Abstract: We propose a new Deep Decompositional Network (DDN) for parsing pedestrian images into semantic regions, such as hair, head, body, arms, and legs, where the pedestri- ans can be heavily occluded. Unlike existing methods based on template matching or Bayesian inference, our approach directly maps low-level visual features to the label maps of body parts with DDN, which is able to accurately estimate complex pose variations with good robustness to occlusions and background clutters. DDN jointly estimates occluded regions and segments body parts by stacking three types of hidden layers: occlusion estimation layers, completion layers, and decomposition layers. The occlusion estimation layers estimate a binary mask, indicating which part of a pedestrian is invisible. The completion layers synthesize low-level features of the invisible part from the original features and the occlusion mask. The decomposition layers directly transform the synthesized visual features to label maps. We devise a new strategy to pre-train these hidden layers, and then fine-tune the entire network using the stochastic gradient descent. Experimental results show that our approach achieves better segmentation accuracy than the state-of-the-art methods on pedestrian images with or without occlusions. Another important contribution of this paper is that it provides a large scale benchmark human parsing dataset1 that includes 3,673 annotated samples collected from 171 surveillance videos. It is 20 times large
|
Similar papers:
[rank all papers by similarity to this]
|
|
A Method of Perceptual-Based Shape Decomposition [pdf]
Chang Ma, Zhongqian Dong, Tingting Jiang, Yizhou Wang, Wen Gao |
|
Abstract: In this paper, we propose a novel perception-based shape decomposition method which aims to decompose a shape into semantically meaningful parts. In addition to three popular perception rules (the Minima rule, the Short-cut rule and the Convexity rule) in shape decomposition, we propose a new rule named part-similarity rule to encourage consistent partition of similar parts. The problem is for- mulated as a quadratically constrained quadratic program (QCQP) problem and is solved by a trust-region method. Experiment results on MPEG-7 dataset show that we can get a more consistent shape decomposition with human per- ception compared with other state-of-the-art methods both qualitatively and quantitatively. Finally, we show the ad- vantage of semantic parts over non-meaningful parts in ob- ject detection on the ETHZ dataset.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Action Recognition and Localization by Hierarchical Space-Time Segments [pdf]
Shugao Ma, Jianming Zhang, Nazli Ikizler-Cinbis, Stan Sclaroff |
|
Abstract: We propose Hierarchical Space-Time Segments as a new representation for action recognition and localization. This representation has a two-level hierarchy. The first level comprises the root space-time segments that may contain a human body. The second level comprises multi-grained space-time segments that contain parts of the root. We present an unsupervised method to generate this represen- tation from video, which extracts both static and non-static relevant space-time segments, and also preserves their hi- erarchical and temporal relationships. Using simple linear SVM on the resultant bag of hierarchical space-time seg- ments representation, we attain better than, or comparable to, state-of-the-art action recognition performance on two challenging benchmark datasets and at the same time pro- duce good action localization results.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Constant Time Weighted Median Filtering for Stereo Matching and Beyond [pdf]
Ziyang Ma, Kaiming He, Yichen Wei, Jian Sun, Enhua Wu |
|
Abstract: Despite the continuous advances in local stereo match- ing for years, most efforts are on developing robust cost computation and aggregation methods. Little attention has been seriously paid to the disparity refinement.
In this work, we study weighted median filtering for dis- parity refinement. We discover that with this refinement, even the simple box filter aggregation achieves comparable accuracy with various sophisticated aggregation methods (with the same refinement). This is due to the nice weighted median filtering properties of removing outlier error while respecting edges/structures. This reveals that the previously overlooked refinement can be at least as crucial as aggre- gation. We also develop the first constant time algorithm for the previously time-consuming weighted median filter. This makes the simple combination box aggregation + weight- ed median an attractive solution in practice for both speed and accuracy.
As a byproduct, the fast weighted median filtering un- leashes its potential in other applications that were ham- pered by high complexities. We show its superiority in var- ious applications such as depth upsampling, clip-art JPEG artifact removal, and image stylization.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Domain Transfer Support Vector Ranking for Person Re-identification without Target Camera Label Information [pdf]
Andy J. Ma, Pong C. Yuen, Jiawei Li |
|
Abstract: This paper addresses a new person re-identification problem without the label information of persons under non-overlapping target cameras. Given the matched (posi- tive) and unmatched (negative) image pairs from source do- main cameras, as well as unmatched (negative) image pairs which can be easily generated from target domain cameras, we propose a Domain Transfer Ranked Support Vector Ma- chines (DTRSVM) method for re-identification under target domain cameras. To overcome the problems introduced due to the absence of matched (positive) image pairs in target domain, we relax the discriminative constraint to a nec- essary condition only relying on the positive mean in tar- get domain. By estimating the target positive mean using source and target domain data, a new discriminative model with high confidence in target positive mean and low con- fidence in target negative image pairs is developed. Since the necessary condition may not truly preserve the discrim- inability, multi-task support vector ranking is proposed to incorporate the training data from source domain with label information. Experimental results show that the proposed DTRSVM outperforms existing methods without using label information in target cameras. And the top 30 rank accura- cy can be improved by the proposed method upto 9.40% on publicly available person re-identification datasets.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Latent Multitask Learning for View-Invariant Action Recognition [pdf]
Behrooz Mahasseni, Sinisa Todorovic |
|
Abstract: This paper presents an approach to view-invariant ac- tion recognition, where human poses and motions exhibit large variations across different camera viewpoints. When each viewpoint of a given set of action classes is specified as a learning task then multitask learning appears suitable for achieving view invariance in recognition. We extend the standard multitask learning to allow identifying: (1) latent groupings of action views (i.e., tasks), and (2) discrimi- native action parts, along with joint learning of all tasks. This is because it seems reasonable to expect that certain distinct views are more correlated than some others, and thus identifying correlated views could improve recognition. Also, part-based modeling is expected to improve robust- ness against self-occlusion when actors are imaged from different views. Results on the benchmark datasets show that we outperform standard multitask learning by 21.9%, and the state-of-the-art alternatives by 4.56%.
|
Similar papers:
[rank all papers by similarity to this]
|
|
Progressive Multigrid Eigensolvers for Multiscale Spectral Segmentation [pdf]
Michael Maire, Stella X. Yu |
|
Abstract: We reexamine the role of multiscale cues in image seg- mentation using an architecture that constructs a globally coherent scale-space output representation. This charac- teristic is in contrast to many existing works on bottom-up segmentation, which prematurely compress information into a single scale. The architecture is a standard extension of Normalized Cuts from an image plane to an image pyramid, with cross-scale constraints enforcing consistency in the so- lution while allowing emergence of coarse-to-fine detail.
We observe that multiscale processing, in addition to im- proving segmentation quality, offers a route by which to speed computation. We make a significant algorithmic ad- vance in the form of a custom multigrid eigensolver for con- strained Angular Embedding problems possessing coarse- to-fine structure. Multiscale Normalized Cuts is a special case. Our solver builds atop recent results on randomized matrix approximation, using a novel interpolation opera- tion to mold its computational strategy according to cross- scale constraints in the problem definition. Applying our solver to multiscale segmentation problems demonstrates speedup by more than an order of magnitude. This speedup is at the algorithmic level and carries over to any imple- mentation target.
|
Similar papers:
[rank all papers by similarity to this]
|