March 2017 Newsletter - Special Section on Optical Interconnects
D.-W. Chang, I.-C. Lin, and L.-C. Yong
ROHOM: Requirement-Aware Online Hybrid On-Chip Memory Management for Multicore Systems
Many studies have shown that energy consumption of on-chip memory is a critical issue for multicore embedded systems. In order to reduce energy consumption, scratchpad memory (SPM), a software controlled on-chip memory, has been increasingly used. In a typical multicore embedded system that uses SPM in the on-chip memory, each core has local SPM and can access data in both local SPM and SPMs of other cores (i.e., remote SPM). Since the latency and energy of accessing remote SPMs is higher than accessing local SPM, how data are allocated in local and remote SPMs influences the performance and energy consumption of the system. This paper proposes a requirement-aware online hybrid on-chip memory management (ROHOM) method. This method contains an SPM supervisor that determines the SPM allocation of each task according to the dynamic access behavior and SPM requirements of the task. In addition, two policies: 1) free remote SPM space first and 2) get local SPM space first, are proposed in ROHOM to reduce the access frequency of remote SPMs. The experimental results show ROHOM can reduce energy delay product up to 69% (42% on average) in an 8-core system and up to 69% (50% on average) in a 16-core system when compared to a contention aware SPM allocation method. The hardware area overhead is insignificant (about 1.8%).
Emerging Technologies and Applications
S. Bhattacharjee, S. Chatterjee, A. Banerjee, T.-Y. Ho, K. Chakrabarty, and B. B. Bhattacharya
Adaptation of Biochemical Protocols to Handle Technology-Change for Digital Microfluidics
Advances in digital microfluidic (DMF) technologies offer a promising platform for a variety of biochemical applications, ranging from massively parallel DNA analysis and computational drug discovery to toxicity monitoring and medical diagnosis. In this paper, we address the migration problem that arises when the technology undergoes a change in the context of DMFs. Given a biochemical reaction synthesized for actuation on a given DMF architecture, we discuss how the same biochemical reaction can be ported seamlessly to an enhanced architecture, with possible modifications to the architectural parameters (e.g., clock frequency, mixer size, and mixing time) or geometric changes (e.g., change in reservoir locations or mixer positions, inclusion of new sensors or other physical resources). Complete resynthesis of the protocol for the new architecture may often become either inefficient or even infeasible due to scalability, proprietary, security, or cost issues. We propose an adaptation method for handling such technology-changes by modifying the existing actuation sequence through an incremental procedure. The foundation of our method lies in symbolic encoding and satisfiability-solvers, enriched with pertinent graph-theoretic and geometric techniques. This enables us to generate functionally correct solutions for the new target architecture without necessitating a complete resynthesis step, thereby enabling the utilization of these chips by users in biology who are not familiar with the on-chip synthesis tool-flow. We highlight the benefits of the proposed approach through extensive simulations on assay benchmarks.
M. B. Hammouda, P. Coussy, and L. Lagadec
A Unified Design Flow to Automatically Generate On-Chip Monitors During High-Level Synthesis of Hardware Accelerators
Security and safety are more and more important in embedded system design. A key issue hence lies in the ability of systems to respond safely when errors occur at runtime, to prevent unacceptable behaviors that can lead to failures or sensitive data leakage. In this paper, we propose a design approach that automatically generates on-chip monitors (OCMs) during high-level synthesis (HLS) of hardware accelerators (HWaccs). OCM checks at runtime the input/output timing behavior, the control flow execution and algorithmic properties (via American National Standards Institute C assertions) of the monitored HWacc. OCM is implemented separately from the HWacc and an original technique is introduced for their synchronization. Two synthesis options are proposed to a tradeoff between performance and area. Experiment results show that error detection on the control flow is 16× better compared to the existing approaches while the cost of assertions is reduced by 17.48% on average. The impact on execution time (i.e., latency of the HWacc) is decreased by 2.76× at no area penalty and up to 4.5× with less than 10% extra-area. The clock period overhead is at worst less than 5% and the overhead on the synthesis time of the HWacc to generate OCMs is 7.44% on average.
Modeling and Simulation
M. Kvassay, E. Zaitseva, V. Levashenko, and J. Kostolny
Reliability Analysis of Multiple-Outputs Logic Circuits Based on Structure Function Approach
Reliability is one of the principal characteristics in the design of many systems. Logic circuits are one of them. These systems are very interesting objects from the reliability point of view because they have some characteristics that are not very common in classical approaches used in reliability engineering. First, their activity depends not only on the operability of its components (logic gates) but also on other influences, which are represented by values of the circuit input signals. Second, logic circuits are typical instances of noncoherent systems. We propose a new method of computing circuit availability (unavailability) and several measures for investigation of topological properties of a logic circuit. These measures allow us to detect components (logic gates) with the greatest influence on the circuit operability. The method is based on using the methodology of Boolean differential calculus. All the approaches presented in this paper are illustrated using the example of reliability analysis of a one-bit full adder.
A. Ahmadinejad and H. Zarrabi-Zadeh
Finding Maximum Disjoint Set of Boundary Rectangles With Application to PCB Routing
Motivated by the bus escape routing problem in printed circuit boards (PCBs), we study the following optimization problem: given a set of rectangles attached to the boundary of a rectangular region, find a subset of nonoverlapping rectangles with maximum total weight. We present an efficient algorithm that solves this problem optimally in O(n4) time, where n is the number of rectangles in the input instance. This improves over the best previous O(n6) -time algorithm available for the problem. We also present two efficient approximation algorithms for the problem that find near-optimal solutions with guaranteed approximation factors. The first algorithm finds a 2-approximate solution in O(n2) time, and the second one computes a 4/3 -approximation in O(n3) time. The experimental results demonstrate the efficiency of both our exact and approximation algorithms.
X. Liu, S. Sun, X. Li, H. Qian, and P. Zhou
Machine Learning for Noise Sensor Placement and Full-Chip Voltage Emergency Detection
Power supply fluctuation can be a potential threat to the correct operations of processors, in the form of voltage emergency that happens when the supply voltage drops below a certain threshold. Noise sensors (with either analog or digital outputs) can be placed in the nonfunction area of processors to detect voltage emergencies by monitoring the runtime voltage fluctuations. Our work addresses two important problems related to building a sensor-based voltage emergency detection system: 1) offline sensor placement, i.e., where to place the noise sensors so that the number and locations of sensors are optimized in order to strike a balance between design cost and chip reliability and 2) online voltage emergency detection, i.e., how to use these placed sensors to detect voltage emergencies in the hotspot locations. In this paper, we propose integrated solutions to these two problems, respectively, for analog and digital (more specifically, binary) sensor outputs, by exploiting the voltage correlation between the sensor candidate locations and the hotspot locations. For the analog case, we use the Group Lasso and an ordinary least squares approach; for the binary case, we integrate the Group Lasso and the SVM approach. Experimental results show that our approach can achieve 2.3X–2.7X better voltage emergency detection results on average for analog outputs when compared to the state-of-the-art work; and for the binary case, on average our methodology can achieve up to 21% improvement in prediction accuracy compared to an approach called max-probability-no-prediction.
C. Pilato, P. Mantovani, G. Di Guglielmo, and L. P. Carloni
System-Level Optimization of Accelerator Local Memory for Heterogeneous Systems-on-Chip
In modern system-on-chip architectures, specialized accelerators are increasingly used to improve performance and energy efficiency. The growing complexity of these systems requires the use of system-level design methodologies featuring high-level synthesis (HLS) for generating these components efficiently. Existing HLS tools, however, have limited support for the system-level optimization of memory elements, which typically occupy most of the accelerator area. We present a complete methodology for designing the private local memories (PLMs) of multiple accelerators. Based on the memory requirements of each accelerator, our methodology automatically determines an area-efficient architecture for the PLMs to guarantee performance and reduce the memory cost based on technology-related information. We implemented a prototype tool, called Mnemosyne, that embodies our methodology within a commercial HLS flow. We designed 13 complex accelerators for selected applications from two recently-released benchmark suites (Perfect and CortexSuite). With our approach, we are able to reduce the memory cost of single accelerators by up to 45%. Moreover, when reusing memory IPs across accelerators, we achieve area savings that range between 17% and 55% compared to the case where the PLMs are designed separately.
H. Li, X. Wang, J. Xu, Z. Wang, R. K. V. Maeda, Z. Wang, P. Yang, L. H. K. Duong, and Z. Wang
Energy-Efficient Power Delivery System Paradigms for Many-Core Processors
The design of power delivery system plays a crucial role in guaranteeing the proper functionality of many-core processor systems. The power loss suffered on power delivery has become a salient part of total power consumption, and the energy efficiency of a highly dynamic system has been significantly challenged. Being able to achieve a fast response time and multiple voltage domain control, on-chip voltage regulators (VRs) have become popular choices to enable fine-grain power management, which also enlarge the design space of power delivery systems. This paper analytically studies different power delivery system paradigms and power management schemes in terms of energy efficiency, area overhead, and power pin occupation. The analysis shows that compared to the conventional paradigm with off-chip VRs, hybrid paradigms with both on-chip and off-chip VRs are able to maintain high efficiency in a larger range of workloads, though they suffer from low efficiency at light workload. Employed with the quantized power management scheme, the hybrid paradigm can improve the system energy efficiency at light workload by a maximum of 136% compared to the traditional load balanced scheme. Besides this, the in-package (iP) hybrid paradigm further shows its advantage in reducing the physical overheads. The results reveal that at 120 W workload, it occupies only a 10.94% total footprint area or 39.07% power pins of that of the off-chip paradigm. We conclude that the iP hybrid paradigm achieves the best tradeoffs between efficiency, physical overhead, and realization of fine-grain power management.
R. D. Blanton, F. Wang, C. Xue, P. K. Nag, Y. Xue, and X. Li
DFM Evaluation Using IC Diagnosis Data
Design for manufacturability rule evaluation using manufactured silicon (DREAMS) is a comprehensive methodology for evaluating the yield-preserving capabilities of a set of design for manufacturability (DFM) rules using the results of logic diagnosis performed on failed ICs. DREAMS is an improvement over prior art in that the distribution of rule violations over the diagnosis candidates and the entire design are taken into account along with the nature of the failure (e.g., bridge versus open) to appropriately weight the rules. Silicon and simulation results demonstrate the efficacy of the DREAMS methodology. Specifically, virtual data is used to demonstrate that the DFM rule most responsible for failure can be reliably identified even in light of the ambiguity inherent to a nonideal diagnostic resolution, and a corresponding rule-violation distribution that is counter-intuitive. We also show that the combination of physically aware diagnosis and the nature of the violated DFM rule can be used together to improve rule evaluation even further. Application of DREAMS to the diagnostic results from an in-production chip provides valuable insight in how specific DFM rules improve yield (or not) for a given design manufactured in particular facility. Finally, we also demonstrate that a significant artifact of DREAMS is a dramatic improvement in diagnostic resolution. This means that in addition to identifying the most ineffective DFM rule(s), validation of that outcome via physical failure analysis of failed chips can be eased due to the corresponding improvement in diagnostic resolution.
P. Gonzalez-de-Aledo, N. Przigoda, R. Wille, R. Drechsler, and P. Sanchez
Towards a Verification Flow Across Abstraction Levels
The use of formal models to describe early versions of the structure and the behavior of a system has become common practice in the industry. UML and OCL are the de-facto specification languages for these tasks. They allow for capturing system properties and module behavior in an abstract but still formal fashion. At the same time, this enables designers to detect errors or inconsistencies in the initial phases of the design flow—even if the implementation has not already started. Corresponding tools for verification of formal models got established in the recent past. However, verification results are usually not reused in later design steps anymore. In fact, similar verification tasks are applied again, e.g., after the implementation has been completed. This is a waste of computational and human effort. In this paper, we address this problem by proposing a method which checks a given implementation of a system against its corresponding formal method. This allows for transferring verification results already obtained from the formal model to the implementation and, eventually, motivates a new design flow which addresses verification across abstraction levels. This paper describes the applied techniques as well as their orchestration. Afterward, the applicability of the proposed methodology is demonstrated by means of examples as well as a case study from an industrial context.
M. Fawaz and F. N. Najm
Fast Vectorless RLC Grid Verification
Checking the power distribution network of an integrated circuit must start early in the design process when changes to the grid can be more easily implemented. Vectorless verification is a technique that achieves this goal by demanding limited information about the currents drawn from the grid. State of the art techniques that deal with RLC grids become prohibitive even for medium size grids. In this paper, we propose a novel technique that estimates the worst-case voltage fluctuations for RLC grids by carefully selecting the time step, in a way that significantly reduces the number of linear programs that need to be solved, and eliminates the need for other expensive computations, like dense matrix-matrix multiplications. Results show that our technique is accurate and scalable for large grids as it achieves over 19× speedup over existing methods.
LFSR-Based Generation of Multicycle Tests
This paper describes a procedure for computing a multicycle test set whose scan-in states are compressed into seeds for a linear-feedback shift register, and whose primary input vectors are held constant during the application of a multicycle test. The goal of computing multi-cycle tests is to provide test compaction that reduces both the test application time and the test data volume. To avoid sequential test generation, the procedure uses a single-cycle test set to guide the computation of multicycle tests. The procedure optimizes every multicycle test, and increases the number of faults it detects, by adjusting its seed, primary input vector, and the number of functional clock cycles. Optimizing the seed instead of the scan-in state avoids the computation of scan-in states for which seeds do not exist. Experimental results for benchmark circuits are presented to demonstrate the effectiveness of the procedure.
E. E. Tsur
Computer Aided Design of a Microscale Digitally Controlled Hydraulic Resistor
Microscale mechanical networks are prevalent in lab-on-a-chip systems, which are rapidly expanding into biological, chemical, and physical research. In these systems, nano-liter volumes of fluids are manipulated and a precise control of flow in individual segments within a complex network is often desirable. One paradigm for such control suggests adjusting the hydraulic resistance of each segment, relying on the fact that like in electrical circuits, fluid flow is depended upon the relation between the potential drop (pressure difference) and the resistance of the transmitting conductor. Current solutions for the control of hydraulic resistance rely on intricate fabrication processes, are often characterized by a high-biased error and can generally produce a limited range of resistance. Here, a computer-aided design of a six-bit digitally controlled adjustable hydraulic resistor, which features five linear ranges of resistance and a small footprint is presented. This design can be rapidly embedded within a microfluidic network for real time control of fluid flow.
C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou
DLAU: A Scalable Deep Learning Accelerator Unit on FPGA
As the emerging field of machine learning, deep learning shows excellent ability in solving complex learning problems. However, the size of the networks becomes increasingly large scale due to the demands of the practical applications, which poses the significant challenge to construct high-performance implementations of deep learning neural networks. In order to improve the performance as well as to maintain the low power cost, in this paper we design deep learning accelerator unit (DLAU), which is a scalable accelerator architecture for large-scale deep learning networks using field-programmable gate array (FPGA) as the hardware prototype. The DLAU accelerator employs three pipelined processing units to improve the throughput and utilizes tile techniques to explore locality for deep learning applications. Experimental results on the state-of-the-art Xilinx FPGA board demonstrate that the DLAU accelerator is able to achieve up to 36.1× speedup comparing to the Intel Core2 processors, with the power consumption at 234 mW.