Fostering Design and Automation of
Electronic and Embedded Systems

News and Alerts


2017 Newsletter:

  • January  February  March   April   May   June  July  August  September  October  November  December  

February 2017 Newsletter - Special Section on Optical Interconnects

REGULAR PAPERS

Embedded Security

  • Y. Lao, B. Yuan, C. H. Kim, and K. K. Parhi

Reliable PUF-Based Local Authentication With Self-Correction
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7470631

Physical unclonable functions (PUFs) can extract chip-unique signatures from integrated circuits (ICs) by exploiting the uncontrollable randomness due to manufacturing process variations. These signatures can then be used for many hardware security applications including authentication, anti-counterfeiting, IC metering, signature generation, and obfuscation. However, most of these applications require error correcting methods to produce consistent PUF responses across different environmental conditions. This paper presents a novel method to enable lightweight, secure, and reliable PUF-based authentication. A two-level finite-state machine (FSM) is proposed to correct erroneous bits generated by environmental variations (e.g., temperature, voltage, and aging variations). In the proposed method, each PUF response is mapped to a key during design phase. The actual key can be determined from the PUF response only after the chip is fabricated. Because the key is not known to the foundry, the proposed approach prevents counterfeiting. The performance of the proposed method and other applications are also discussed. Our experimental results show that the cost of the proposed self-correcting two-level FSM is significantly less than that of the commonly used error correcting codes. It is shown that the proposed self-correcting FSM consumes about 2× to 10× less area and about 20× to 100× less power than the Bose-Chaudhuri-Hochquenghem codes.

  • F. Sagstetter, M. Lukasiewycz, and S. Chakraborty

Generalized Asynchronous Time-Triggered Scheduling for FlexRay
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7471402

FlexRay is a hybrid communication protocol tailored to the requirements in the automotive domain, supporting both time-triggered and event-triggered communication with high data-rates. The time-triggered static segment is commonly used for in-vehicle communication while the event-triggered dynamic segment is used for diagnostics and configuration. This paper addresses the problem of synthesizing schedules for the static FlexRay segment for asynchronous scheduling, following the design approach of current automotive architectures. Previous approaches largely focused on FlexRay 2.1 while the few existing approaches for FlexRay 3.0 are nonoptimal in several aspects. As a remedy, our framework makes use of all new features of version 3.0 while supporting the still predominantly used FlexRay 2.1, making it backward compatible. The following approaches are proposed: 1) a single-stage integer linear programming (ILP) approach that determines an optimal solution but does not scale; 2) a multistage ILP for combining previously generated subsystem schedules to a global schedule. It clearly improves the scalability but is not optimal. The multistage approach allows to integrate and convert existing FlexRay 2.1 schedules into a FlexRay 3.0 schedule which is important for legacy reasons, i.e., to reduce testing and certification efforts; 3) a greedy heuristic which scales well and obtains high quality solutions in comparison with the optimal solution, but is unsuitable to integrate existing schedules; and 4) metaheuristic approaches based on genetic algorithms or simulated annealing to evaluate the benefits of the proposed approaches.

  • Z. Du, S. Liu, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, Q. Guo, X. Feng, Y. Chen, and O. Temam

An Accelerator for High Efficient Vision Processing
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7497562

In recent years, neural network accelerators have been shown to achieve both high energy efficiency and high performance for a broad application scope within the important category of recognition and mining applications. Still, both the energy efficiency and performance of such accelerators remain limited by memory accesses. In this paper, we focus on image applications, arguably the most important category among recognition and mining applications. The neural networks which are state-of-the-art for these applications are convolutional neural networks (CNNs), and they have an important property: weights are shared among many neurons, considerably reducing the neural network memory footprint. This property allows to entirely map a CNN within an SRAM, eliminating all DRAM accesses for weights. By further hoisting this accelerator next to the image sensor, it is possible to eliminate all remaining DRAM accesses, i.e., for inputs and outputs. In this paper, we propose such a CNN accelerator, placed next to a CMOS or CCD sensor. The absence of DRAM accesses combined with a careful exploitation of the specific data access patterns within CNNs allows us to design an accelerator which is highly energy-efficient. We present a single-core implementation down to the layout at 65 nm, with a modest footprint of 5.94 mm2 and consuming only 336 mW, but still about 30× faster than high-end GPUs. For visual processing with higher resolution and frame-rate requirements, we further present a multicore implementation with elevated performance.

Emerging Technologies and Applications

  • R. Wang, D. Jia, T. Li, and D. Qian

Achieving Versatile and Simultaneous Cache Optimizations With Nonvolatile SRAM
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7496955

The efficiency of caches plays a vital role in microprocessors. In this paper, we introduce a novel and flexible cache substrate, which integrates nonvolatile memory devices into the standard SRAM cells. By allowing this nonvolatile SRAM (NV-SRAM) cell to store inconsistent data between SRAM portion and NV portion, we show that the proposed NV2-SRAM cache not only provides enriched functionalities, but also allows simultaneous multiple optimizations. For example, the NV2-SRAM cache can reduce cache misses caused by context-switching and improve the performance by 15%. It can also save up to 67% energy over the SRAM-based cache, outperforming the drowsy cache in terms of both power efficiency and reliability. Moreover, the proposed cache architecture can be used to improve the performance of prefetching by 10%. Comparing with a conventional cache (equipped with a victim buffer) that occupies the same die area, the NV2-SRAM cache gains an 11% performance benefit. To achieve simultaneous optimizations, we propose architecture and OS support to optimize the cache power, performance and reliability concurrently on multicore-based systems.

  • X. Chen, D. Zhang, L. Wang, N. Jia, Z. Kang, Y. Zhang, and S. Hu

Design Automation for Interwell Connectivity Estimation in Petroleum Cyber-Physical Systems
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7497519

In a petroleum cyber-physical system (CPS), interwell connectivity estimation is critical for improving petroleum production. An accurately estimated connectivity topology facilitates reduction in the production cost and improvement in the waterflood management. This paper presents the first study focused on computer-aided design for a petroleum CPS. A new CPS framework is developed to estimate the petroleum well connectivities. Such a framework explores an innovative water/oil index integrated with the advanced cross-entropy optimization. It is applied to a real industrial petroleum field with massive petroleum CPS data. The experimental results demonstrate that our automated estimations well match the expensive tracer-based true observations. This demonstrates that our framework is highly promising.

Logic Synthesis

  • A. Saifhashemi, H.-H. Huang, and P. A. Beerel

Reconditioning: A Framework for Automatic Power Optimization of QDI Circuits
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7476844

This paper introduces reconditioning: a novel systematic technique for reducing unnecessary power consumption of asynchronous gate-level netlists, which involves the optimal reordering of conditional communication and logic primitives. Our technique is applicable to asynchronous circuits with handshaking protocols that encode data and control together, in particular, quasi delay insensitive and 1-of-N handshaking circuits. Both an optimal integer linear program (ILP) and a fast heuristic algorithm are presented. We show that our ILP is feasible for moderate size circuits and our heuristic algorithm scales to much larger circuits, completing in seconds on circuits with tens of thousands of gates. Our experimental results show power improvement highly depends on the structure of the circuit but can often be above 26% with typically less than 5% area overhead.

  • A. Saifhashemi, H.-H. Huang, and P. A. Beerel

Reconditioning: A Framework for Automatic Power Optimization of QDI Circuits
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7476844

This paper introduces reconditioning: a novel systematic technique for reducing unnecessary power consumption of asynchronous gate-level netlists, which involves the optimal reordering of conditional communication and logic primitives. Our technique is applicable to asynchronous circuits with handshaking protocols that encode data and control together, in particular, quasi delay insensitive and 1-of-N handshaking circuits. Both an optimal integer linear program (ILP) and a fast heuristic algorithm are presented. We show that our ILP is feasible for moderate size circuits and our heuristic algorithm scales to much larger circuits, completing in seconds on circuits with tens of thousands of gates. Our experimental results show power improvement highly depends on the structure of the circuit but can often be above 26% with typically less than 5% area overhead.

System-Level Design

  • C. Chen, Y. Fu, and S. Cotofana

Towards Maximum Utilization of Remained Bandwidth in Defected NoC Links
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7473926

To maximize the utilization of the available networks-on-chip (NoCs) link bandwidth, partially faulty links with low fault level should be utilized while heavily defected (HD) links should be deactivated and dealt with by means of a fault tolerant routing algorithm. To reach this target, we make the following contributions in this paper: 1) we propose a flit serialization (FS) method to efficiently utilize partially faulty links. The FS approach divides the links into a number of equal width sections, and serializes sections of adjacent flits to transmit them on all fault-free link sections to mitigate the unbalance between the flit size and the actual link bandwidth; 2) we propose the link augmentation with one redundant section as a low cost mechanism to mitigate the FS drawback that a link's available bandwidth is reduced even if it contains only one faulty wire; and 3) we deactivate HD links when their fault level exceed a certain threshold to diminish congestion caused by HD links. The optimal threshold is derived by comparing the zero load packet transmission latency on the HD links and that on the shortest alternative path. Our proposal is evaluated with synthetic traffic and PARSEC benchmarks. Experimental results indicate that the FS method can achieve lower area*power/saturation_throughput value than all state of the art link fault tolerant strategies. With a redundant section in each link, the NoC saturation throughput can be largely improved than just utilizing FS, e.g., 18% when 10% of the NoC wires are broken. Simulation results we obtained at various wire broken rate configurations indicate that we achieve the highest saturation throughput if 4- or 8-section links with a flit transmission latency longer than four cycles are deactivated.

  • Z. Zhao, A. Gerstlauer, and L. K. John

Source-Level Performance, Energy, Reliability, Power and Thermal (PERPT) Simulation
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7488221

With ever increasing design complexities, traditional cycle-accurate or instruction-set simulations are often too slow or too inaccurate for system prototyping in early design stages. As an alternative, host-compiled or source-level software simulation has been proposed, but existing approaches have largely focused on timing simulation only. In this paper, we propose a novel source-level simulation infrastructure that provides a full range of performance, energy, reliability, power and thermal (PERPT) estimation. Using a fully automated, retargetable back-annotation framework, intermediate representation code is statically annotated with timing, energy and resource accesses information obtained from low-level references at basic block granularity. The annotated model is natively compiled and combined with a cache model and occupancy analyzer to provide target performance, energy, soft-error vulnerability and power estimations. Finally, generated power traces are fed into thermal models for further temperature estimation. Comprehensive evaluations of our source-level models for PERPT estimations are performed. We applied our approach to PowerPC targets running various industry benchmark suites. source-level simulations are evaluated for different PERPT metrics and with cache models at various levels of detail to explore the speed and accuracy tradeoffs. More than 90% accuracy can be achieved for timing, energy, reliability and power estimation, and an average error of 0.05 K exists in steady-state thermal estimation. Simulation speeds range from 180 to 5740 MIPS for different types of metrics at different abstraction levels.

  • X. Lou, Y. J. Yu, and P. K. Meher

Lower Bound Analysis and Perturbation of Critical Path for Area-Time Efficient Multiple Constant Multiplications
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7498633

In this paper, a precise systematic delay model is proposed for the analysis and estimation of critical path delay of multiple constant multiplication (MCM) blocks. For the first time in literature, the mathematical derivation of lower bound of critical path delay of MCM blocks is presented and necessary conditions for achieving the lower bound of critical path delay are discussed. It is shown that the lower bound of critical path delay of MCMs is significantly smaller than that achieved by existing MCM algorithms. An improved genetic algorithm-based approach, with a heuristic algorithm to generate the initial population, is proposed to search for low complexity MCM solutions with the lower bound of critical path delay. This is the first time that design algorithms with gate-level delay control is proposed. Moreover, it is shown that using the information of lower bound of critical path delay, perturbation of timing can be applied to tradeoff the lower bound critical path delay against hardware complexity. It is shown that area-time efficient design of MCM blocks can be obtained by using the proposed techniques.

TEST

  • A. Mysore Somashekar and S. Tragoudas

Diagnosis of Performance Limiting Segments in Integrated Circuits Using Path Delay Measurements
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7476831

An approach capable of identifying the locations of distributed small delay defects, arising due to manufacturing aberrations, is proposed. It is shown that the proposed formulation can be transformed into a Boolean satisfiability form to be solved by any satisfiability solver. The approach is capable of providing a small number of alternative sets of potential defective segments, and one of the solutions is the actual defect configuration. This is shown to be a very important property toward the effective identification of the defective segments. Experimental analysis on International symposium on circuits and systems and International Test Conference benchmark suites show that the proposed approach is highly scalable and identifies the location of multiple delay defects.

  • J. Park, H. Lim, and S. Kang

FRESH: A New Test Result Extraction Scheme for Fast TSV Tests
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7488276

Three-dimensional integrated circuits (3-D ICs) are considered to meet the performance needs of future ICs. The core components of 3-D ICs are through-silicon vias (TSVs), which should pass appropriate prebond and post-bond tests in 3-D IC fabrication processes. The test inputs must be injected into the TSVs, and the test results must be extracted. This paper proposes a new test result extraction scheme [fast result extraction by selective shift-out (FRESH)] for prebond and post-bond TSV testing. With additional hardware, the proposed scheme remarkably reduces the TSV test time. FRESH avoids unnecessary test result extraction when the number of faulty TSVs in the TSV set is 0 or exceeds the number of TSV redundancies in the set. These early fault analyses are executed in the checkers of TSV groups. The experimental results show that the proposed scheme can reduce the result extraction time in practical environments.

SHORT PAPERS

  • L. Zhu, Y. Badr, S. Wang, S. Iyer, and P. Gupta

Assessing Benefits of a Buried Interconnect Layer in Digital Designs
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7478063

In sub-15 nm technology nodes, local metal layers have witnessed extremely high congestion leading to pin-access-limited designs, and hence affecting the chip area and related performance. In this paper, we assess the benefits of adding a buried interconnect layer below the device layers for the purpose of reducing cell area, improving pin access, and reducing chip area. After adding the buried layer to a projected 7 nm standard cell library, results show ~9%-13% chip area reduction and 126% pin access improvement. This shows that buried interconnect, as an integration primitive, is very promising as an alternative method to density scaling.

  • Irith Pomeranz

Sequential Test Generation Based on Preferred Primary Input Cubes
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7469862

It was shown earlier that certain primary input values have a negative effect on the fault coverage of a functional test sequence when they appear repeatedly in the sequence. A gate-level sequential test generation procedure based on this observation computed a primary input cube c with preferred values, and generated random functional test sequences that conformed to c with a high probability 0.5 ? p <; 1. This procedure selected values for p out of a set of possible values, and assigned the same value of p to all the primary inputs. Motivated by the low computational complexity of this procedure, this paper addresses the selection of p and the possibility of using different values of p for different primary inputs. The goal is to increase the fault coverage and reduce the number of functional test sequences. The procedure described in this paper adjusts a functional test sequence to a circuit by complementing values that conflict with c. The procedure requires fewer functional test sequences to reach or exceed the fault coverage of the earlier procedure for benchmark circuits. The procedure can be applied to any functional test sequence or set of functional test sequences.