“The area here refers to the amount of logic resources that a design consumes in FPGA/CPLD. For FPGA, it can be measured by the consumed FF (flip-flop) and LUT (look-up table), and a more general measure can be the equivalent logic occupied by the design. number of doors.
“
1. Balance and exchange of area and speed
The area here refers to the amount of logic resources that a design consumes in FPGA/CPLD. For FPGA, it can be measured by the consumed FF (flip-flop) and LUT (look-up table), and a more general measure can be the equivalent logic occupied by the design. number of doors.
Speed refers to the highest frequency that the design can run stably on the chip. This frequency is determined by the timing conditions of the design and the clock requirements that the design meets: PAD to PAD time, Clock Setup Time, Clock Hold Time, Clock-to-Output Delay and many other timing feature quantities are closely related.
The two indicators of area and speed run through the clock of FPGA/CPLD design and are the ultimate standard for evaluating design quality – area and speed are a pair of contradictory unity of opposites.
It is unrealistic to require one with the smallest design area and the highest operating frequency at the same time. A more scientific design goal should be to occupy the smallest chip area under the premise of meeting the design timing requirements (including the design frequency requirements). Or under the specified area, the timing margin of the design is larger and the frequency runs higher. These two goals fully embody the idea of balancing area and speed.
As the two components of the contradiction, the status of area and speed is not the same. In contrast, it is more important to meet the requirements of timing and operating frequency. When the two conflict, the principle of speed priority is adopted.
Theoretically speaking, if a design has a large timing margin and can run at a speed much higher than the design requirements, then the chip area consumed by the entire design can be reduced through the reuse of functional Modules, which is to exchange the advantage of speed for the area savings. On the contrary, if the timing requirements of a design are very high, and the common method cannot reach the design frequency, then generally by converting the data stream to serial and parallel, copying multiple operation modules in parallel, the whole design can be run with the idea of ping-pong operation and serial-to-parallel conversion. .
2. Hardware Principles
The hardware principle is mainly for HDL code writing: Verilog is an abstraction of hardware in the form of C language. Its essential function is to describe the hardware, and its final realization result is the actual Circuit inside the chip. Therefore, the final criterion for judging the quality of a piece of HDL code is: the performance of the hardware circuit it describes and implements, including area and speed.
To evaluate the code level of a design is to say that the design is more fluent and reasonable in the form of conversion from hardware to HDL code. The final performance of a design depends to a greater extent on the efficiency and rationality of the hardware implementation scheme conceived by the design engineer. (HDL code is just one form of expression for hardware design)
Beginners one-sided pursuit of code clean, short, is wrong, is contrary to the standard of HDL. For the correct coding method, first of all, it is necessary to have a good grasp of the hardware circuit to be realized, and the structure and connection of the hardware in this part are very clear, and then express it with an appropriate HDL statement.
In addition, Verilog, as an HDL language, is hierarchical. System Level – Algorithm Level – Register Transfer Level – Logic Level – Gate Level – Switch Level. Building a priority tree will consume a lot of combinational logic, so if you can use case, try to use case instead of if…..else…
3. System Principles
The system principle contains two levels of meaning: from a higher level, it is a hardware system, how a single board performs module cost and task allocation, what algorithms and functions are suitable for implementation in FPGA, and what algorithms and functions It is suitable for implementation in DSP/CPU, as well as for FPGA scale estimation data interface design. Specific to FPGA design, it is necessary to have a macro-level rational arrangement for the overall design, such as clock domain, module reuse, constraints, area, speed and other issues. The optimization of modules on the system is the most important.
Generally speaking, the functional modules with high real-time requirements and high frequency are suitable for FPGA implementation. Compared with CPLD, FPGA is more suitable for the design of larger scale, higher frequency and more registers. When using FPGA/CPLD design, you should have a deep understanding of various underlying hardware resources inside the chip and available design resources.
For example, FPGA generally has abundant flip-flop resources, and CPLD has more abundant combinational logic resources. FPGA/CPLD is generally composed of low-level programmable hardware units, BRAM, wiring resources, configurable IO units, clock resources, etc.
The underlying programmable hardware unit generally consists of flip-flops and look-up tables. The underlying programmable hardware resources of xilinx are larger than SLICE and consist of two FFs and two LUTs. Altera’s underlying hardware resource is called LE, which consists of 1 FF and 1 LUT. Common unit modules such as single-port RAM, dual-port RAM, synchronous/asynchronous FIFO, ROM, and CAM can be implemented using on-chip RAN.
Simplified process for general FPGA system planning
4. Synchronous Design Principles
The logic core of the asynchronous circuit is realized by the combinational logic circuit, such as asynchronous FIFO/RAM read and write signals, address decoding and other Circuits. The main signals and output signals of the circuit do not depend on any clock signal, and are not generated by the clock signal driving the FF. The biggest disadvantage of asynchronous sequential circuits is that they are prone to glitches, especially when simulating after placement and routing and observing actual signals with a logic analyzer.
The core logic of the synchronous sequential circuit is realized by various flip-flops, and the main signals and output signals of the circuit are generated by a certain clock edge driving the flip-flops. Synchronous sequential circuits can avoid glitches very well, and there are no glitches in the simulation after placement and routing, and sampling the actual working signal with a logic analyzer.
Do sequential circuits necessarily use more resources than asynchronous circuits? From a simple ASCI design, about 7 gates are needed to implement a D flip-flop, and one gate can implement a 2-input NAND gate, so in general, a synchronous sequential circuit occupies a larger area than an asynchronous circuit. (It is different in FPGA/CPLD, mainly because of the calculation method of the unit block)
How to realize the delay of synchronous sequential circuit? The general method of generating delay in asynchronous circuit is to insert a Buffer, two-level NAND gate, etc. This delay adjustment method is not suitable for synchronous timing design. First of all, it is necessary to clarify the delay control grammar in the HDL grammar, which is a behavior-level code description and is often used for simulation test stimulation, but it will be ignored in circuit synthesis and cannot start the delay effect.
The delay of the synchronous sequential circuit is generally completed by timing control, in other words, the delay of the synchronous sequential circuit is designed as a circuit logic. For relatively large delays and special timing requirements, a high-speed clock is generally used to generate a counter, and the delay is controlled by the count of the counter; for relatively small delays, D flip-flops can be used to hit them. One clock cycle has elapsed, and the initial synchronization of the signal to the clock is completed, which is used in input signal sampling and increasing timing constraint margins.
How is the clock of a synchronous sequential circuit generated? The quality and stability of the clock directly determine the performance of the synchronous sequential circuit. The synchronous sequential circuit of the input signal requires synchronization of the input signal. If the beat of the input data is at the same frequency as the processing clock of the chip of this stage, and the hold time match is established, the main clock of the chip of this stage can be used to directly sample the input data register. Complete the synchronization of input data. If the input data and the processing clock of the chip of this stage are asynchronous, especially when the frequency does not match, the input data must be sampled twice by the processing clock to complete the synchronization of the input data.
Is it defined as a Reg type, it must be synthesized into a register, and it is a synchronous sequential circuit? The answer is negative. The two most commonly used data types in Verilog are Wire and Reg. Generally speaking, the Wire type specifies the book data and the network cable through combinational logic, while the reg type specifies the data is not necessarily implemented by registers.
5. Ping-pong operation
“Ping-pong operation” is a processing technique often used in data flow control. The processing flow of ping-pong operation is: the input data stream is allocated to two data buffers isochronously through the “input data selection unit”, and the data buffer module can For any storage module, the more commonly used storage units are dual-port RAM (DPRAM), single-port RAM (SPRAM), FIFO and so on.
In the first buffering cycle, the input data stream is buffered to the “data buffer module 1”; in the second buffering cycle, the input data stream is buffered to the “data buffer module 2” through the switching of the “input data selection unit”. ”, at the same time, the first cycle data buffered by “data buffer module 1” is selected by “input data selection unit” and sent to “data stream operation processing module” for operation processing; in the third buffer cycle, through “input data Switch the “select unit” again, buffer the input data stream to “data buffer module 1”, and at the same time, switch the data in the second cycle buffered by “data buffer module 2” through the “input data selection unit” and send it to “data buffer module 2”. Stream operation processing module” to perform operation processing. so cycle.
Typical ping pong operation method
The biggest feature of the ping-pong operation is that the operation and processing are performed through the input data selection unit and the output data selection unit. Taking the ping-pong operation module as a whole, and watching the data from both ends, the input data and output data flow are continuous without any pause, so it is very suitable for pipeline processing of data flow. Therefore, ping-pong operations are often used in pipelined algorithms to achieve seamless buffering and processing of data.
The second advantage of ping-pong operations is that they save buffer space. For example, in WCDMA baseband applications, a frame is composed of 15 time slots, and sometimes it is necessary to delay the data of a whole frame by one time slot before processing. time slot for processing. At this time, the length of the buffer is the data length of 1 frame. Assume that the data rate is 3.84Mb/s and 1 frame is 10ms. At this time, the length of the buffer needs to be 38400bit. If the ping-pong operation is used, it is only necessary to define two buffers 1 timeslot. When writing data to one RAM, read data from another RAM, and then send it to the processing unit for processing. At this time, the capacity of each RAM is only 2560bit, and the total capacity of 2 blocks is 5120bit.
Ping-pong operation with low-speed modules to handle high-speed data streams
In addition, clever use of ping-pong operation can also achieve the effect of processing high-speed data streams with low-speed modules. As shown in Figure 2, the data buffer module adopts a dual-port RAM, and a first-level data preprocessing module is introduced after the DPRAM. This data preprocessing can be performed according to various data operations required. For example, in the WCDMA design, the input data Stream despreading, descrambling, de-rotating, etc. Assuming that the input data stream rate of port A is 100Mbps, the buffer period of the ping-pong operation is 10ms.
6. Serial-parallel conversion design skills
Serial-to-parallel conversion is an important skill in FPGA design. It is a common method for data stream processing and a direct embodiment of the idea of area and speed interchange. There are many ways to realize serial-to-parallel conversion. According to the requirements of data sorting and quantity, registers, RAM, etc. can be selected to realize.
In the previous example of ping-pong operation, the serial-to-parallel conversion of the data stream is realized through DPRAM, and because of the use of DPRAM, the data buffer can be opened very large, and the serial-to-parallel conversion can be completed by registers for a relatively small number of designs. If there is no special requirement, synchronous timing design should be used to complete the conversion between serial and parallel. For example, from serial to parallel, the data is arranged in the high order first, which can be implemented with the following encoding: prl_temp
where prl_temp is the parallel output buffer register and srl_in is the serial data input. The serial-to-parallel conversion that is specified in the arrangement order can be realized by using the case statement. For complex serial-to-parallel conversion, it can also be implemented with a state machine. The method of serial-to-parallel conversion is relatively simple and need not be described here.
7. Design idea of pipeline operation
The first thing to declare is that the pipeline described here refers to a design idea of processing flow and sequential operations, not “Pipelining” used to optimize timing in FPGA and ASIC design.
Pipelining is a common design method in high-speed design. If the processing flow of a design is divided into several steps, and the entire data processing is “single flow”, that is, there is no feedback or iterative operation, and the output of the previous step is the input of the next step, you can consider using the pipeline design method to Increase the operating frequency of the system.
The structure of the pipeline design
The schematic diagram of the pipeline design is shown in the figure. Its basic structure is: connecting appropriately divided n operation steps in a single flow direction. The biggest feature and requirement of pipeline operation is that the processing of data flow in each step is continuous in terms of time. If each operation step is simplified and assumed to pass through a D flip-flop (that is, using a register to beat a beat), then pipeline operation Just like a shift register bank, the data flow flows through the D flip-flops in turn to complete the operation of each step.
Pipeline Design Timing
A key to pipeline design lies in the reasonable arrangement of the entire design sequence, which requires a reasonable division of each operation step. If the operation time of the pre-stage is exactly equal to the operation time of the post-stage, the design is the simplest, and the output of the pre-stage can be directly imported into the input of the post-stage; if the operation time of the pre-stage is greater than that of the post-stage, the output of the pre-stage needs to be The data can only be imported to the input of the latter stage if it is properly cached; if the operation time of the former stage is just less than the operation time of the latter stage, the data flow must be shunted through the replication logic, or the data should be stored and post-processed at the former stage, otherwise It will cause subsequent data overflow.
Pipeline processing methods such as RAKE receiver, searcher, preamble acquisition, etc. are often used in WCDMA design. The reason why the pipeline processing method is more frequent is because the processing module is copied, which is another concrete manifestation of the idea of exchanging area for speed.
8. Synchronization method of data interface
Synchronization of data interface is a common problem in FPGA/CPLD design, and it is also a key and difficult point. Many design instability is due to the synchronization problem of data interface. In the circuit diagram design stage, some engineers manually add BUFT NOR gates to adjust the data delay, so as to ensure that the clock of the module at this level has the requirements for the establishment and hold time of the data of the upper-level module.
In order to have stable sampling, some engineers generate a lot of clock signals with a difference of 90 degrees, and sometimes use the positive edge to hit the data, and sometimes use the negative edge to hit the data to adjust the sampling position of the data. Neither approach is desirable because the sampling implementation must be redesigned once the chip is replaced or migrated to another chipset. Moreover, these two methods result in insufficient margin for circuit implementation. Once the external conditions change (such as temperature rise), the sampling sequence may be completely disordered, resulting in circuit paralysis.
The input and output delays (inter-chip, PCB wiring, delays of some driving interface components, etc.) are unmeasurable, or under conditions that may change, how to complete data synchronization? For the unpredictable or variable data delay, a synchronization mechanism needs to be established, and a synchronization enable or synchronization indication signal can be used. In addition, data synchronization can also be achieved by accessing data through RAM or FIFO.
Do I need to add constraints to design data interface synchronization? It is recommended to add appropriate constraints, especially for high-speed designs, be sure to add corresponding constraints on period, setup, hold time, etc. There are two functions of the additional constraints here: to increase the working frequency of the design to meet the requirements of interface data synchronization; to obtain the correct timing analysis report.
The Links: CM1000HA-28H SK45GAL063