| 1 |
|
Specification and Implementation of a DMA Controller in an Embedded System
The fast growing of In-Car entertainment application leads to an increasing challenge for both data computation and data communication, which are managed by the microprocessor. The thesis project is the third stage of a continuous Direct Memory Access(DMA) Controller project in NXP Semiconductors for the purpose of specifying and implementing a DMA Controller to take over data communication tasks from the microprocessor.
In the first step of the thesis, a test principle was investigated to fully test the existing results, but the simulation results of the Core Unit did not satisfy the requirements. The Core Unit of the DMA Controller is responsible for the sequential-single transfer and burst transfer involving wait states. The existing specification and implementation were analyzed, and a number of possible approaches for improvements were identified. During the second step, the Core Unit was re-specified according to these approaches, and fully implemented using VHDL to fulfill the requirements. After the Core Unit design, the functions of Linked List transfer was specified with Hatley and Pirbhai methodology. The Linked List Unit, which manages the Linked List transfer, was specified to support both the Static and Dynamic Linked List transfer. This specification provides an essential base for the future implementation.
The implementation of the Core Unit was tested with Simvision following the proposed test principle. The results satisfied the function requirements. Thus, the specification was proved to be feasible. Additionally, the Core Unit was synthesized using Cadence Ambit. The number of the equivalent gates of a Core Unit Cell is 3k, which is smaller than the currently used DMA Controller in NXP Semiconductors.
|
[PDF]
[Abstract]
|
| 2 |
|
Desynchronization Methods for Scheduled Circuits
Synchronous systems waste a lot of power in the clock tree, and must be designed based on the worst case scenario in terms of speed. Asynchronous circuits offer relief to these problems, by replacing clock signals with handshakes which only charge when data is being transferred, and delay signals which may adapt more easily to variance in speed compared to the clock period. Desynchronization is the process of turning a synchronous circuit into an asynchronous one. Scheduled circuits are a common way to provide a good compromise between conserving area of a circuit and increasing its speed. Desynchronizing such a system is made difficult because every functional unit in the circuit must respond to controls from the central state machine, which cannot easily handshake with all of them. This report demonstrates two related methods designed specifically for the conversion of a synchronous, scheduled circuit into an asynchronous, delay insensitive circuit. Decomposition of the central state machine into local, smaller ones is used to combat the problem of skew in the control signals, as well as to speed up the performance of the asynchronous circuit. The slack in the clock period can also be used for possible speedup. Conditions which threaten deadlock of the circuit are identified and rescheduling solutions are proposed. A tradeoff between the two methods of area conservation and hardware reusability versus speed is also explained.
|
[PDF]
[Abstract]
|
| 3 |
|
Optimization of the Belief Propagation algorithm for Luby Transform decoding over the Binary Erasure Channel.
Live-streaming media applications over the Internet are characterized by time deadlines and bandwidth constraints. Reliability over the Internet has been provided traditionally by the Transmission Control Protocol (TCP) based on retransmissions. However, resending the missed information leads to a waste in time and bandwidth. Erasure correcting codes can be used as an alternative to TCP.
In this thesis, we consider the use of Luby Transform (LT) codes, which are part of the Digital Fountain (DF) codes. LT codes show a low encoding and decoding time as opposite to other erasure codes as Reed-Solomon (RS) and Low-Density Parity-Check (LDPC) codes.
They are also the first realization of rateless codes, where the number of encoded symbols is potentially limitless. Therefore, they are suitable for Internet applications, where the channel conditions can change very fast or be unknown. The accepted efficient decoding algorithm for LT codes is the Belief Propagation (BP) algorithm, unfortunaly it shows a rather poor performance when used with small sizes of message symbols. This turns out to be a limitation in live-streaming applications, as they should wait until that number of source symbols have been received for attempting decoding.
In our project, we explore optimisations of the BP decoding process for LT codes when the number of information symbols is small. We present two new decoding algorithms that improve the performance of BP while keeping a low complexity. We show simulation results of the new LT decoding algorithms success rate and complexity versus overhead when used with small sizes, proving the gain in performance compare with BP.
|
[PDF]
[Abstract]
|
| 4 |
|
System-level Fault-Tolerance Analysis of Small Satellite On-Board Computers
Commercial Off-The-Shelf (COTS) electronic components offer cost-effective solutions for the development of On-Board Computers (OBCs) in the small satellite industry. However, the COTS parts are not originally designed to withstand the space radiation environment. Traditional fault-tolerance practices rely on expensive radiation tests or are based on circuit-level knowledge which are not easily available. This work proposes a novel simulation-based statistical approach to assist the satellite designers in performing OBC fault-tolerance analysis.
The presented novel approach is based on high-level system modeling and an object-oriented fault injection mechanism. Such a technique allows the comparison between fault-tolerance techniques and reveals the consequences of radiation effects in the COTS parts at early development stages.
The work covers the implementation of the proposed simulation framework which includes the OBC and fault modeling. The fault models are based on the conducted radiation environment analysis. The range of software and hardware fault detection and mitigation techniques are investigated as case studies. They include time and hardware Triple-Modular Redundancy, FPGA-based memory scrubbing with Hamming encoding, and watchdog/co-processor monitoring. The case studies reveal that the proposed approach can be used to choose suitable fault-tolerance techniques, increase their efficiency, and reduce the required hardware resources.
Three papers are included:
- SystemC-based On-board Computer Modeling for Design Fault-Tolerance Assessment
- A Simulator of On-Board Computers for Evaluating Fault-Mitigation Techniques
- System Fault-tolerance Analysis of Small Satellite On-board Computers
|
[PDF]
[Abstract]
|
| 5 |
|
Time Synchronization in Wireless Sensor Networks
Accurate time synchronization is crucial for many applications of Wireless Sensor Networks. Extensive research is performed on this topic, however, there are some aspects of time synchronization which still require attention.
In this thesis we investigate aspects of the network which influence the accuracy of packed based time synchronization in Wireless Sensor Networks. The network of four nodes that are capable of wireless communication is constructed. Each node consists of Arduino Mega board with ATMEGA2560 micro-controller and RZ502 Accessory Kit with AT86RF230 radio chip. Three time-stamp exchange models are implemented on this network to synchronize the clocks and and compared with each other. The estimation of the synchronization parameters is done using Least Square estimators.
Phase offset, frequency offset and communication delay are required to be estimated during the synchronization. From basic time-stamp exchange scheme, only sender-receiver model enables the estimation of the communication delay. The global synchronization leads to more accurate synchronization compared to the pair-wise synchronization. Based on the constructed setup, the aspects which influence the accuracy of time synchronization are identified: clock stability, clock resolution, time-stamping moment, physical layer property, communication aspects, synchronization time, number of synchronization messages, synchronization interval, time-stamp exchange model and synchronization algorithm. The requirements on these aspects are derived from the demand on the accuracy of the synchronization.
|
[PDF]
[Abstract]
|
| 6 |
|
High-Quality, Real-Time HD Video Stereo Matching on FPGA
Stereo matching is an important computer vision technique, which extracts the depth information of the scene by matching a pair of stereo images. It has numerous applications, such as view-point interpolation, 3DTV, object detection, etc. In the past decades, many algorithms have been proposed to improve the matching quality or to increase the speed. Due to the high computational complexity, it is still quite challenging to attain high matching-quality at real-time speed.
In this thesis, we propose a hardware design of stereo matching, which is capable of producing high-quality disparity maps at real-time speed. A high-quality stereo matching algorithm is efficiently implemented and hardware-oriented optimized, attaining huge speedup by parallel computing. The whole algorithm is implemented in a single EP3SL150 FPGA. The experimental results show that our design is capable of matching high-definition videos at real-time speed, i.e. 60 frame per second at 1024×768 resolution. In terms of matching quality, our design is among the leading real-time methods, evaluated in the Middlebury stereo benchmark. As an application of the stereo matcher, we also build up a depth-scaling system for 3DTV, working together with a view synthesis module. The SoC system synthesizes high-quality virtual views at real-time speed.
|
[PDF]
[Abstract]
|
| 7 |
|
MEP-MAS: A message passing multiprocessor array for streaming applications
This thesis presents the design and implementation of a Chip-Multiprocessor (CMP) targeted at streaming applications(e.g. MPEG, MP3). Streaming applications are applications which can be split into several distinct stages working on data elements in a pipelined fashion. We propose a distributed-memory array (MEP-MAS), where the cores communicate via message-passing, optimizing the throughput. Application tasks are dynamically scheduled by a hardware scheduler taking the consumer-producer locality into account, thereby minimizing the communication overhead. The array is evaluated in terms of performance, scalability and predictability as a function of varied input stream sizes, multiple pipelines, number of pipeline stages and traffic volume. The array is configured as a 4 by 5 mesh and has reached speedups as high as 3.6x for a 4-stage pipeline and 13.4x for a 16-stage pipeline. Our experiments have highlighted the need for a balanced workload in order to optimize the performance. Furthermore, it is shown that MEP-MAS is scalable as the speedup and throughput almost linearly increases with the the number of added pipelines. The speedup has increased from 3.6x to 13.5x and the throughput from 17k data elements per second to 65k data elements per second. Increasing the traffic volume in the network marginally affects the speedup (-1.9%). Finally, increasing the traffic volume can cause a high deviation in arrival times between two subsequent data blocks in the pipeline of up to 8%.
|
[PDF]
[Abstract]
|
| 8 |
|
MEP-MAS: A Message Passing Multiprocessor Array for Streaming Applications
This thesis presents the design and implementation of a Chip-Multiprocessor (CMP) targeted at streaming applications(e.g. MPEG, MP3). Streaming applications are applications which can be split into several distinct stages working on data elements in a pipelined fashion. We propose a distributed-memory array (MEP- MAS), where the cores communicate via message-passing, optimizing the throughput. Application tasks are dynamically scheduled by a hardware scheduler taking the consumer-producer locality into ac- count, thereby minimizing the communication overhead. The array is evaluated in terms of performance, scalability and predictability as a function of varied input stream sizes, multiple pipelines, number of pipeline stages and traffic volume. The array is configured as a 4 by 5 mesh and has reached speedups as high as 3.6x for a 4-stage pipeline and 13.4x for a 16-stage pipeline. Our experiments have highlighted the need for a balanced workload in order to optimize the performance. Furthermore, it is shown that MEP-MAS is scalable as the speedup and throughput almost linearly increases with the number of added pipelines. The speedup has increased from 3.6x to 13.5x and the throughput from 17k data elements per second to 65k data elements per second. Increasing the traffic volume in the network marginally affects the speedup (-1.9%). Finally, increasing the traffic volume can cause a high deviation in arrival times between two subsequent data blocks in the pipeline of up to 8%.
|
[PDF]
[Abstract]
|
| 9 |
|
Design Partitioning for Custom Hardware Emulation
Hardware verification is a very important step of system design. Various techniques are used for this purpose one of which is hardware emulation. Hardware emulation is a very efficient and flexible technique with high speed performance in comparison to other approaches. Emulation using programmmable hardware can provide a very fast and feature rich debugging environment for system verification. The size and the complexity of todays Integrated Circuit designs though may exceed the size of the programmable devices used by the emulator in order to map the design under test. Therefore, in order to create a prototype of emulator and the design under test, we need to find a way to partition the whole design on the several programmable devices of the emulator. This thesis addresses the problem of design partitioning for a custom emulator using the flatten netlist of the design and implementing a variation of the graph partitioning algorithm of Fiduccia–Mattheyses. The tool that we have developed extends the Fiduccia–Mattheyses algorithm while retaining the linear runtimes that the algorithm has in order to fit the various constraints of a custom emulator. We extensively test the various parameters of the algorithm and the impact they have on the performance of the tool and report the behavior and the improvement on the number of cutted nets from an arbitrary and a manually clustered partition. In both cases the improvement is more than 50% upon the initial cut.
|
[PDF]
[Abstract]
|
| 10 |
|
Design and analysis of a coherent memory
sub-system for FPGA-based embedded systems
Cache coherence and memory consistency are of the most decisive and challenging issues in the design of shared-memory multi-core systems that influence both the correctness and performance of parallel programs. In this thesis, we identify and analyze the problem of designing a coherent/consistent memory subsystem in general and then focus on FPGA-based multi-core embedded systems containing general purpose CPUs and dedicated hardware accelerators. We narrow down the range of the problem by targeting only the stream-based applications and developing dedicated application-specific solutions. A flexible Windowed-FIFO communication pattern is proposed to be used by the parallel programs being run on the multi-core system. The software APIs for the FPGA platform are implemented and tested, a customized streaming cache memory is designed, implemented and tested based on the proposed communication pattern and in the end, example embedded systems are developed and tested on the FPGA platform to prove the correct functionality of the APIs, the cache memory and the coherent data communication between the cores. All the tests are done on a Xilinx Spartan3dsp development board and all the hardware and software aspects of the FPGA platform are studied and their influence on the memory system is analyzed. The simulations and analyses show that the developed solution has less complexity and more scalability and portability comparing to existing solutions while it provides a flexible range of functionality that different streaming parallel applications can benefit from.
|
[PDF]
[Abstract]
|