| 1 |
|
Integration of Segmentation and Stereo Matching
|
[PDF]
|
| 2 |
|
Watch me if you can!
This project explored the application of computer vision technology in a social game to develop a web-based photo-bombing game for the HITLab NZ. I performed a literature research to find out more about game design and computer vision methods. Some subjects covered were the psychological background of pranks and social games. My technology research focused on varieties of computer vision to be able to recognise players in uploaded photos. Furthermore, I used the outcomes of a context mapping session with four potential users to develop an interaction vision and define requirements for a successful game. My main conclusions were that social gameplay especially was a major opportunity to include in my game, and that many different computer vision methods have different benefits and disadvantages that should be further explored before incorporating one in my design. This exploration was executed using an evolutionary design method and a design framework based on several elements of game design. Brainstorming sessions and a morphological chart led to game mechanisms that explored the idea scenarios regarding social play and the fulfilment of quests, and a technological development based on QR codes and face recognition. The final design combined both idea scenarios by offering social quests, and used face recognition as identification method. The game was developed and explained in more detail using the game design framework and various diagrams: a function diagram explaining the input, output, and processes of player, website, and servers; and a system diagram to connect the functions and interactions to form a player-friendly interface. Because of technological limitations regarding player identification found in the game development, the final user test focused on the gameplay and user experience compared to my vision and expectations. I tested the enjoyment level, balance, player retention, and tourist reactions. I found that my test persons were reluctant to cross social boundaries for more elaborate photo-bombing quests. They felt that, in theory, the game sounded like an amazing way to spend some spare time on holiday, almost like a treasure hunt. In practice, however, they hardly got round to playing it. My final recommendations were to review the choice of a broad target group and to pay a lot of attention to the implementation of the game, so it becomes socially accepted. More importantly, the success of the game depends heavily on future computer vision developments: as it stands right now, the methods I explored are not reliable enough.
|
[Abstract]
|
| 3 |
|
Single person pose recognition and tracking
The goal of this research is to improve a system capable to detect, track a single person and recognize poses real time for controlling a spatial game. After performing background subtraction, the human blob is segmented in order to track the torso and hands. Angles and distances between hands and torso center are used to compute the features. Finally, a 10-Nearest-Neighbor classifier recognizes 9 predefined Poses which are used by the player to control the game. This work contributes with two improvements. The first one is a more robust improved hand detection combining the current skin color detection with human blob information. The second improvement is a classifier that recognizes Non-Poses in addition to the 9 predefined Poses that are used in the game.
|
[PDF]
[Abstract]
|
| 4 |
|
An intelligent camera system for the Healthcare
Objective: Injuries caused by falls of elderly people are a common worldwide problem and ageing of population will even further increase related burdens and costs. Recent technology using active monitoring systems have proven their success in order to analyze human actions. What is lacking in these researches is implementation in real elderly home environments. Most of the healthcare researches are focusing on the detection of falls and not on the detection of normal daily actions. We present a single camera with a fisheye lens which is capable of monitoring an entire room. The use of only one camera reduces the costs and simplifies the computational burden which results in a real time system. While different research is done on the detection of such actions, none of these is done using real data by elderly people in their own living environment. Using this data will increase the difficulty level of the action recognition, because every living environment will have different settings and noise factors.
Main: We developed an action detection system which monitors the actions of elderly people in their homes during normal daily activities with the idea to raise the alarm in the case of danger. Our system is equipped with a single wide angle camera mounted on the ceiling of an elderly home. This gives a topview image of the environment resulting in a clear map of household objects without any occlusions. The main idea is to monitor the motion information of elderly and to model actions as a change of motion or poses in time that leads to a specific action. After background subtraction using Gaussian Mixture Models, the motion information is extracted using the Motion History Images method and analyzed to detect important actions. We propose to model actions as the shape deformations of the motion history image in time. Every action is defined at several moments in time, called “Action peaks” using different features, the holistic area, contour and location measurements as well as the Fourier shape descriptors. We combine all the measurements into the Bag of Word model and create unique action representations called „Action Signatures‟. These action signatures are then transformed and combined using feature fusion in order to learn the optimal combination of features for each action. Learning the optimal feature fusion is performed using Support Vector Machines. The final trained system is used to classify each new action.
Results: the result section is divided into 2 sections. First the scientific data is used which is recorded in a testing room, simulating elderly home, with colleagues and students. We recorded and detected multiple actions: Bending, Walking, Falling, Collapsing, all with very high accuracy rates, above 93%. Finally real data is recorded in real elderly homes observing 4 elderly people. Different actions are monitored: Walking, Sitting, Open Door, and Eating. Results in a real environment depict high detection rates and prove that the system is able to detect multiple human actions using only one single camera.
|
[PDF]
[Abstract]
|
| 5 |
|
A Knowledge-Intensive Approach to Computer Vision Systems
This thesis focusses on the modelling of knowledge-intensive computer vision tasks. Knowledge-intensive tasks are tasks that require a high level of expert knowledge to be performed successfully. Such tasks are generally performed by a task expert. Task experts have a lot of experience in performing their task and can be a valuable source of information for the automation of the task. We propose a framework for creating a white-box ontology-based computer vision application.
White-box methods have the property that the internal workings of the system are known and transparent. They can be understood in terms of the task domain. An application that is based on explicit expert knowledge has a number of inherent advantages, among which corrigibility, adaptability, robustness, and reliability. We propose a design method for developing white-box computer vision applications that consists of the following steps: (i) define the scope of the task and the purpose of the application, (ii) decompose the task into subtasks, (iii) define and refine application ontologies that contain the descriptive knowledge of the expert, (iv) identify computational components, (v) specify explicit procedural knowledge rules, and (vi) implement algorithms required by the procedural knowledge.
The scope is one of the cornerstones of the application, since it sets the boundaries of the task. The problem owner and the domain experts are together responsible for setting the scope and defining the purpose. Scope and purpose are important for the task decomposition and for the specification of the application ontologies. The scope and purpose help the domain engineer to keep focus in creating dedicated ontologies for the application.
The decomposition of the task into subtasks models the domain expert’s “observe – interpret – assess” way of performing a visual inspection task. This decomposition leads to a generic framework of subtasks alternated with application ontologies. The list of consecutive subtasks – record object, find structures, identify object parts, determine parameters, determine quality – can be reused for any visual inspection task.
Application ontologies are task-specific ontologies containing the descriptive knowledge relevant for the task. We have described an interview-based knowledge acquisition method that is suited for modelling multi-domain, multiexpert task-specific ontologies. Using the knowledge of multiple experts leads to a rich application ontology; adding an outsider’s perspective from domain experts from other involved domains, leads to an expression of knowledge that may be too trivial for task experts to mention or may not be part of the usual perspective of the task experts.
Knowledge acquisition based on interviews and observations only has some disadvantages. It takes a lot of modelling time for domain expert and knowledge engineer, it is difficult for the knowledge engineer to give a structured and full overview of his knowledge, and a model is created from scratch, even though reusable sources may exist. We have therefore introduced a reuse-based ontology construction component that gives domain expert a more prominent and active role in the knowledge acquisition process. This component prompts the domain expert with terms from existing knowledge sources to help him create a full overview of his knowledge. We show that this method is an efficient way to obtain a semi-formal description of the domain knowledge.
With the decomposition of the knowledge-intensive task into subtasks interspersed with descriptive knowledge models completed, we focus on the subtasks. Each of these subtasks can be represented by a sequence of components that perform a clearly defined part of a task. To specify these components, we explicitly identify for each service in the computational workflow (i) the input concepts, (ii) the output concepts, and (iii) a human readable (high level) description of the service. This information is used as documentation for the procedural knowledge.
Besides transparency of descriptive knowledge, transparency of processing knowledge is a desirable feature of a knowledge-intensive computer vision system. We show that blindly embedding software components in a transparent way may have an adverse effect. In some cases, transparency is not useful or desired. To support the software developer to make a balanced decision on whether transparency is called for, we have proposed a set of decision criteria – availability of expertise, application range of a component, triviality, explanation, and availability of third-party expertise. These decision criteria are paired to means of adding transparency to an application. We have elaborated several examples from the horticultural case study to show which transparency decisions are made for which reasons.
Using the framework for designing knowledge-intensive computer vision applications, we have implemented a prototype system to automatically assess the quality of tomato seedlings. We have shown that the proposed design method indeed results in a white-box system that has adaptability, corrigibility, reliability and robustness as properties. We provide guidelines on how to implement tool support for the adaptability and corrigibility properties of the system, to better assist the end users of the application. Moreover, we show how organisational learning and building trust in the system are supported by the white-box setup of the computer vision application.
|
[PDF]
[Abstract]
|
| 6 |
|
Crowd control by multiple cameras
One of the goals of the crowd control project at Delft University of Technology is to detect and track people during a crisis event, classify their behavior and assess what is happening. The assumption is that the crisis area is observed by multiple cameras (fixed or mobile). The cameras sense the environment and extract features such as the amount of motion. These features are the input to a Bayesian network with nodes corresponding to situations such as terroristic attack, fire, and explosion. Given the probabilities of the observed features, by reasoning, the likelihood of the possible situations can be computed. A prototype was tested in a train compartment and its environment. Forty scenarios, performed by actors, were recorded. From the recordings the conditional probabilities have been computed. The scenarios are designed as scripts which proved to be a good methodology. The models, experiments and results will be presented in the paper.
|
[PDF]
[Abstract]
|
| 7 |
|
Sensor fusion in head pose tracking
The focus of this thesis is on studying diverse techniques, methods and sensors for position and orientation determination with application to augmented reality applications.
In Chapter 2 we reviewed a variety of existing techniques and systems for position determination. From a practical point of view, we discussed the need for a mobile system to localize itself while navigating through an environment. We identified two different localization instantiations, position tracking and global localization. In order to determine what information a mobile system has access to regarding its position, we discussed different sources of information and pointed out advantages and disadvantages. We concluded that due to the imperfections in actuators and sensors due to noise sources, a navigating mobile system should localize itself using information from different sensors.
In Chapter 3, based on the analysis of the technologies presented in the Chapter 2 and the sensors described in this chapter, we selected a set of sensors from which to acquire and fuse the data in order to achieve the required robustness and accuracy. We selected for the inertial system three accelerometers (ADXL105) and three gyroscopes (Murata ENC05). To correct for gyro drift we use a TCM2 sensor that contains a two-axis inclinometer and a three-axes magnetometer (compass). Indoors we use a Firewire webcam to obtain the position and orientation information. Outdoors we use, in addition, a GPS receiver in combination with a radio data system (RDS) receiver to obtain DGPS correction information.
Chapter 4 was concerned with development of inertial equations required for the navigation of a mobile system. To understand the effect of error propagation, the inertial equations were linearized. In this chapter we decompose the localization problem into attitude estimation and, subsequently, position estimation. We focus on obtaining a good attitude estimate without building a model of the vehicle dynamics. The dynamic model was replaced by gyro modeling. An Indirect (error state) Kalman filter that optimally incorporates inertial navigation and absolute measurements was developed for this purpose. The linear form of the system and measurement equations for the planar case derived here allowed us to examine the role of the Kalman filter as a signal processing unit. The extension of this formulation to the 3D case shows the same benefits. A tracking example in the 3D case was also shown in this chapter.
Chapter 5 details all the necessary steps for implementing a vision positioning system. The pose tracking system for outdoor augmented reality is partly based on a vision system that tracks the head position within centimeters, the head orientation within degrees, and has an update rate of within a second. The algorithms that are necessary to obtain a robust vision system for tracking the automotion of a camera based on its observations of the physical world contain feature detection algorithms, camera calibration routines and pose determination algorithms.
In Chapter 6 we summarize the presented work with concluding remarks. Here, we also present ideas and possibilities for future research.
The conclusion is that since existing technology or sensor alone cannot solve the pose problem, we combine information from multiple sensors to obtain a more accurate and stable system. We present the development of an entire position determination system using off-the-shelve existing sensors integrated using separate Kalman filters. A unified solution is presented: inertial measurement integration for orientation and GPS in combination with a differential correction unit for positioning.
|
[PDF]
[Abstract]
|
| 8 |
|
Development of a computerized handbook of architectural plans
The dissertation investigates an approach to the development of visual / spatial computer representations for architectural purposes through the development of the computerized handbook of architectural plans (chap), a knowledge-based computer system capable of recognizing the metric properties of architectural plans. This investigation can be summarized as an introduction of computer vision to the computerization of architectural representations: chap represents an attempt to automate recognition of the most essential among conventional architectural drawings, floor plans. The system accepts as input digitized images of architectural plans and recognizes their spatial primitives (locations) and their spatial articulation on a variety of abstraction levels. The final output of chap is a description of the plan in terms of the grouping formations detected in its spatial articulation. The overall structure of the description is based on an analysis of its conformity to the formal rules of its stylistic context (which in the initial version of chap is classical architecture).
Chapter 1 suggests that the poor performance of computerized architectural drawing and design systems is among others evidence of the necessity to computerize visual / spatial architectural representations. A recognition system such as chap offers comprehensive means for the investigation of a methodology for the development and use of such representations.
Chapter 2 describes a fundamental task of chap: recognition of the position and shape of locations, the atomic parts of the description of an architectural plan in chap. This operation represents the final and most significant part of the first stage in processing an image input in machine environment.
Chapter 3 moves to the next significant problem, recognition of the spatial arrangement of locations in an architectural plan, that is, recognition of grouping relationships that determine the subdivision of a plan into parts. In the absence of systematic and exhaustive typologic studies of classical architecture that would allow us to define a repertory of the location group types possible in classical architectural plans, Chapter 3 follows a bottom-up approach based on grouping relationships derived from elementary architectural knowledge and formalized with assistance from Gestalt theory and its antecedents. The grouping process described in Chapter 3 corresponds both in purpose and in structure to the derivation of a description of an image in computer vision [Marr 1982].
Chapter 4 investigates the well-formedness of the description of a classical architectural plan in an analytical manner: each relevant level (or sublevel) of the classical canon according to Tzonis & Lefaivre [1986] is transformed into a single group of criteria of well-formedness which is investigated independently. The hierarchical structure of the classical canon determines the coordination of these criteria into a sequence of cognitive filters which progressively analyses the correspondence of the descriptions derived as in Chapter 3 to the constraints of the canon.
The methodology and techniques presented in the dissertation are primarily considered with respect to chap, a specific recognition system. The resulting specification of chap gives a measure of the use of such a system within the context of a computerized collection of architectural precedents and also presents several extensions to other areas of architecture. Although these extensions are not considered as verifiable claims, Chapter 5 describes some of their implications, including on the role of architectural drawing in computerized design systems, on architectural typologies, and on the nature and structure of generative systems in architecture.
|
[PDF]
[Abstract]
|
| 9 |
|
Integration of 3D tracking systems for Interaction in Spatial Augmented Reality
In this thesis, a projector-based Spatial Augmented Reality (SAR) system designed and developed to be applied to support physical and virtual 3D Rapid Prototyping in the field of Industrial Design Engineering is presented. The main contribution is a 3D scanner to get a virtual model of a physical model and tools to support the design of features interactively, on the object’s surface.
More specifically, this work contains an approach to set the hardware to support SAR, else known as hardware calibration for SAR. Each hardware entity is calibrated with respect to a common “world”, in order to achieve effective communication. This world is set to coincide with the graphics world, and this allows us to imagine being inside the 3D graphics world while the virtual content is rendered onto the scene’s objects. In order to identify the basic ingredients that enable interaction in our SAR system, we take into account the limitations of Rapid Prototyping process, background knowledge for SAR systems and related work. Therefore, we designed the interaction components of the system according to characteristics of our setting. The SAR system was designed to perform in the peripersonal region. In this region, the user inserts input via a constructed IR tracked pen and a dynamic menu is used as interface to the system. Functionality such as selection, feedback and annotation is enabled for interacting with the SAR system. The system’s application is divided into two parts. The first part includes the use of the RGB-D camera and the IR tracker for the construction of a 3D scanner, in order to produce a virtual model of an object through sampling, segmentation, and registration of sequential point clouds. In the second part, the result of the scanning process, which is a polygonal mesh of the scanned object, is added to the SAR system’s application that enables interaction with virtual models and the ground level of the world. These two parts of the SAR system application aim to support industrial designers in the scanning of a freshly made physical prototype, and enable the design of features on the corresponding virtual model by using the SAR system during the rapid prototyping process.
In order to identify the strong points and weaknesses of the current state of the SAR system application, we carried out a user evaluation. 21 students from the Faculty of Industrial Design Engineering evaluated the two parts of the system’s application. The results show that the SAR system is useful and that it has great potential in the field of Industrial Design Engineering. Nevertheless, there is still room for improvement and future work, in order to be fully applied in the field.
|
[PDF]
[Abstract]
|
| 10 |
|
Automation in Architectural Photogrammetry: Line-Photogrammetry for the Reconstruction from Single and Multiple Images
Architectural photogrammetry has been practised for more than a century for the documentation of cultural heritage. Nowadays, the emphasis is on the construction of computer models for virtual reality applications. Since the introduction of the computer, and later the digital camera, research in photogrammetry aims at automation. This thesis reports on research on automation in architectural photogrammetry for efficient reconstruction of detailed building models from one or more, possibly widely separated, digital close-range images. This research lies on the fringes of photogrammetry and computer vision. It treats topics frequently studied in computer vision in a photogrammetric way and offers new solutions. Examples cover interior orientation and reconstruction from a single mage, vanishing point detection, and the wide-baseline stereo problem. A semi-automatic approach is chosen that exploits knowledge of the object shape, such as planarity of facades, rectangular and repeating structures in the building, and shape symmetries. Automatically or manually extracted straight image line features are the main observations in the line-photogrammetric approaches presented in this thesis. Furthermore, the methods developed are characterised by the use of robust direct solutions for approximate value computation, followed by least-squares adjustment in which the knowledge of the shape of the building is processed together with the image line observations. This integral adjustment provides optimal estimates for the object model parameters and facilitates quality assessment.
|
[PDF]
[Abstract]
|
| 11 |
|
Multi-Scale Pattern Recognition for Image Classification and Segmentation
Scale is an important parameter of images. Different objects or image structures (e.g. edges and corners) can appear at different scales and each is meaningful only over a limited range of scales. Multi-scale analysis has been widely used in image processing and computer vision, serving as the basis for many high-level image analysis systems. One such high-level system is based on supervised learning as studied in pattern recognition and machine learning, which might take the results from multi-scale analysis as its input. Supervised learning defines a classifier to assign objects into different categories, and learns the classifier with some example objects whose category labels are known.
A common characteristic of the current multi-scale analysis methods, however, is that they are designed without specific assumptions about the high-level image analysis systems. The problem is that, different tasks need images to be analysed at different scales, that is, they need different multi-scale analysis. For example, for the same image containing a person, small scales are needed if the problem is to segment the eyes, while large scales are needed when one wants to segment the person. In many applications, the task is defined only with some given example images and it is not known a priori the right scale to conduct analysis. This asks for multi-scale analysis frameworks which can adapt to the different tasks.
The aim of this thesis is to study such adaptive multi-scale frameworks based on supervised learning. It focuses on three important aspects in multi-scale analysis: scale selection, scale invariance, and scale combining. Scale selection addresses the problem of choosing a right scale to detect an object or to analyse an image. Scale invariance is the ability to deal with objects appearing at arbitrary sizes. Scale combining concerns the combination of information from all scales. General learning frameworks are proposed for these three aspects. Examples are shown for image segmentation and classification problems.
A learning-based scale selection method is proposed for supervised image segmentation. Supervised segmentation trains a classifier based on some given segmented images, which assigns the pixels of an image into different classes or segments. The input of the classifier is features extracted from a neighbourhood at each pixel, and the scale of this neighbourhood is a crucial parameter of the features. Scale is usually selected as the size of a certain image structure, which is, however, not necessarily the best for the segmentation task. Keeping this in mind, the selected scale for supervised segmentation is redefined as the one at which pixels from different classes are best separable. A general scale selection scheme is proposed, which relies on the classifier for segmentation to measure the class separability. Experiments are presented, which show that this scheme can indeed choose scales that are best for the segmentation problem and thus leads to significantly improved performance.
Based on the proposed scale selection scheme, a scale-invariant classification framework is proposed for supervised image segmentation. This classifier can deal with images from arbitrary scales. Consequently, the same segmentation result will be obtained when an image is resized. The classifier is trained with image features from all scales, and thus able to handle images from any scales. To make the classifier not biased on particular scales, the right proportion of features from different scales is needed. Scale invariance of the classification is achieved with the proposed scale selection scheme in the testing phase, which finds the right scales for image structures of different sizes.
A learning model closely related to the proposed scale-invariant classification is multiple-instance learning (MIL). MIL is a generalised supervised-learning framework that represents an object as a bag consisting of many feature vectors called instances. Only some of the instances in the bag are informative about the label of the object, while others share the same probability distribution for objects from different classes. In the training phase, only the labels of bags (not instances) are known, and a classifier is trained to separate bags into different classes. These characteristics make MIL fit well for multi-scale image analysis, as an object can be represented with a set of features from all scales and only features from some scales are informative. Features from other scales are uninformative as the object becomes too blurred or too small to be distinguished from other objects. Observing that MIL algorithms usually make effective use of only one, not all, informative instance in a bag, we propose a new MIL model to. A simple MIL classifier is obtained, which performs very well for numerous data sets in the experiments.
Combining information from multiple scales is studied based on the dissimilarity representation. It has been recognised that information from more than one scale can be useful for image analysis and should be exploited for better performance. For learning-based image analysis, multi-scale information is usually combined by concatenating features from all scales, which typically creates an enormously high-dimensional feature vector and thus makes learning difficult. We use the dissimilarity representation as it enables to combine multi-scale information without increasing the dimensionality of the representation space. It represents an image with dissimilarities by comparing it with a set of reference images. Multi-scale information is exploited by computing dissimilarities at each scale and then combining these dissimilarities. Various rules are proposed and tested with real-world image classification problems. The results show that simple combining rules can already improve significantly upon the best result from the individual scales, and more adaptive rules, which exploit certain structures along the scale, can lead to even better results.
|
[PDF]
[Abstract]
|