Algorithm and Hardware Co-design on Fpga Sandbox

Sensors (Basel). 2020 Oct; 20(xix): 5655.

A Novel Hardware–Software Co-Design and Implementation of the HOG Algorithm

Received 2020 Aug 31; Accustomed 2020 Sep 30.

Abstract

The histogram of oriented gradients is a normally used feature extraction algorithm in many applications. Hardware acceleration tin can boost the speed of this algorithm due to its large number of computations. Nosotros suggest a hardware–software co-pattern of the histogram of oriented gradients and the subsequent support vector car classifier, which can be used to procedure data from digital image sensors. Our main focus is to minimize the resource usage of the algorithm while maintaining its accurateness and speed. This pattern and implementation make four contributions. First, nosotros allocate the computationally expensive steps of the algorithm, including gradient calculation, magnitude computation, bin assignment, normalization and classification, to hardware, and the less complex windowing step to software. Second, we introduce a logarithm-based bin assignment. Third, nosotros employ parallel computation and a time-sharing protocol to create a histogram in society to achieve the processing of one pixel per clock cycle later the initialization (setup time) of the pipeline, and produce valid results at each clock bicycle afterwards. Finally, nosotros use a simplified block normalization logic to reduce hardware resource usage while maintaining accuracy. Our blueprint attains a frame rate of 115 frames per second on a Xilinx^® Kintex^® Ultrascale^™ FPGA while using less hardware resources, and only losing accurateness marginally, in comparison with other existing piece of work.

Keywords: hardware–software co-design, histogram of oriented gradients, bin assignment, FPGA resource usage, accurateness loss, frame charge per unit

1. Introduction

Feature extraction is one of the master stages in a wide variety of pattern recognition applications [1,2,3]. In particular, feature extraction and clarification has been used in numerous reckoner vision algorithms for many applications. There have been many feature description algorithms proposed in the past decade, such every bit SIFT (scale-invariant feature transform) [three] and SURF (speeded upward robust features) [four], which have all shown outstanding results in a broad diverseness of applications. Pig (histogram of oriented gradients) [5] is ane of the commonly used descriptors which has proven to be useful in many computer vision applications, including human detection, car detection, and full general object recognition.

An object detection system is typically a combination of an input sensor, a feature extraction module and a classifier to make decisions based on the extracted features. Since the principal awarding of the HOG features is in man and object detection, the output of a photographic camera sensor is given to the HOG descriptors, usually followed past a suitable classifier such as SVM (support vector machines). The main drawback of the Grunter algorithm is its computational complexity, which prevents information technology from meeting the timing requirements of some practical applications. Therefore, many researchers have tried to implement this algorithm on hardware platforms such as GPUs (graphical processing units) and FPGAs (field programmable gate arrays) to reap the benefits from parallel computation and thus better speed.

Ma et al. [vi] compared the CPU (central processing unit), GPU, and FPGA implementation of the Sus scrofa algorithm. They implement the Pig algorithm on an Intel^® Xeon^® E5520 CPU processor, an Nvidia^® Tesla^® K20 GPU, and a Xilinx^® Virtex^®-6 FPGA. Their FPGA implementation consumes 130× less free energy than the CPU and 31x less energy than the GPU to process a single frame, while the speed is almost 68x better than the CPU and 5x ameliorate than the GPU.

Since FPGA implementations typically consume less ability than GPUs and CPUs, there has been considerable interest in FPGA implementation in numerous applications [seven,8,9]. In item, many scholars take contributed to FPGA implementation of the Pig algorithm [10,eleven,12,13,14]. Hardware implementations are unremarkably evaluated by the four main metrics of speed, accuracy, ability consumption and resource utilization. Since there are trade-offs between these metrics, many researchers aim to optimize a unmarried metric depending on the specific application. One mode to optimize these metrics is to benefit from the advantageous features of both hardware and software in implementing the algorithm. In these methods, the algorithm is partitioned into dissimilar functional stages and the most computationally complex stages are implemented on the FPGA. The stages that are sequential in nature or are controlling the data flow can be allocated to the CPU.

In this work, we suggest a hardware–software co-pattern of the HOG algorithm. Our implementation consists of a fully pipelined HOG-SVM IP-core which is controlled by a MicroBlaze^™ processor. MicroBlaze^™ is a soft microprocessor core designed for Xilinx^® FPGAs. This design benefits from both the computational efficiency of the hardware and the simplicity of command mechanisms in software. Our method preserves accuracy and speed while decreasing resource utilization. Figure 1 shows the KCU105 FPGA board and the test environment of this research project. A sample epitome from the INRIA dataset [fifteen] and the cake diagram of the whole proposed organization in Vivado^® software which we used are shown on the display. We used the UART (universal asynchronous receiver/transmitter) port to load the epitome in the arrangement as a thing of experimental convenience. We could input the prototype in whatever other fashion and the results would be the same.

An external file that holds a picture, illustration, etc. Object name is sensors-20-05655-g001.jpg

The KCU105 FPGA board (left) connected to the calculator station (correct) for experiments.

We accept made four contributions in this paper. The first is an efficient task allocation between hardware and software. The 2nd is a logarithm-based bin assignment in the HOG algorithm. The 3rd is a hardware pattern for computing the histograms using two parallel modules. Finally, the quaternary contribution is the approximation of the normalization level which preserves the accuracy of the organization while reducing the hardware resource consumption.

In the rest of this paper, we introduce the Sus scrofa algorithm briefly in Section 2. And then, nosotros review existing works that focus on hardware–software co-design of the HOG algorithm in Section iii. So, we introduce our hardware–software co-pattern method in Section 4. In Section 5, we provide the details of our implementation. In Section 6, we compare the results of our design with other work and talk over the advantages and disadvantages of our design. Finally, in Section vii, we provide conclusions and future work directions.

ii. Review of the Algorithm

In this section, we review the Hog algorithm and the SVM classifier briefly in Section two.i and Section 2.ii, respectively.

two.1. The Squealer Algorithm

The Grunter algorithm has several steps, as shown in Figure 2.

An external file that holds a picture, illustration, etc. Object name is sensors-20-05655-g002.jpg

A flowchart of the Hog algorithm from input image sensor to HOG features.

In the kickoff step, the derivatives in the horizontal and vertical directions are calculated for every pixel based on the side by side pixels around them in a 3 by 3 neighborhood, as shown in Equations (ane) and (2):

$1000_{x} (x, y) = I (ten + 1, y) - I (x - 1, y)$

(ane)

${One thousand}_{y} (x, y) = I (x, y + i) - I (x, y - i)$

(two)

where I(10,y) represents the image pixel located in x and y coordinates, and $G_{x}$ and ${One thousand}_{y}$ signal the gradients of the horizontal and vertical directions, respectively.

In the 2d pace, the magnitude of the gradients is computed as shown in Equations (3). In addition, the orientation of each pixel is calculated by calculating the arctan value of the gradient in vertical management $G_{y}$ over the gradient in the horizontal direction $G_{x}$ , as shown in Equations (4).

$Orientation (ten, y) = \tan^{- one} \frac{G_{y}}{G_{x}}$

(4)

As shown in Figure 2, the next step adds the magnitude values to the bins according to the orientation of each pixel, for histogram generation.

In the fourth step, the histograms of the blocks are normalized separately. For block normalization, usually the L2-norm is used. For each block, which contains four histograms, the value of each bin in each histogram is multiplied by itself. The normalized value of each bin is the value of that bin divided past the foursquare root of the summation of the squares of these values, as shown in Equations (5):

where $h_{n}$ is the normalized histogram, $h_{i}$ is the value of each bin, h is the initial histogram, and ԑ is a very minor number to preclude sectionalization by zero. The final HOG features are the concatenation of the normalized histograms.

Squealer is computed for groups of pixels in the prototype. For case, every not-overlapping xvi pixels (4 × four) grade a prison cell and every four cells (ii × 2) grade a cake. Figure 3 represents this bureaucracy. In Effigy 3, the orientation of the gradient for each pixel is shown by arrows. The boldness and size of the arrows represent the magnitude of the gradients for that pixel.

An external file that holds a picture, illustration, etc. Object name is sensors-20-05655-g003.jpg

Visualization of cells (4 by 4 pixels) and blocks (each containing iv cells) in the Hog algorithm.

2.ii. Support Vector Automobile

We chose an SVM as a classifier for ii reasons. First, it is widely used with Sus scrofa features and has shown outstanding results, particularly for human detection applications [5]. 2nd, the inference footstep of this classifier typically consumes fewer hardware resources than other classifiers, such as those based on neural networks. Therefore, later on computing the HOG features, we employ the SVM classifier for making decisions. An SVM is a linear classifier, which is used in many applications. In the training stage of the SVM classifier, the nearest samples to the conclusion boundary (support vectors) are determined. Using optimization techniques, this classifier maximizes the margin of the support vectors from the conclusion boundary.

In the testing stage, there is no optimization required. We can allocate a sample using only precomputed weights of the SVM from the training phase and the feature vectors, as in Equations (half-dozen):

where x represents the input features, and ${westward}_{i}$ and b are the weights and bias term learned by the classifier in the training stage, respectively. For classifying a sample, f(x) is compared to a threshold (commonly zero) and a determination is made based on this comparison. Due to the accuracy and simplicity of the testing stage in an SVM classifier, it is a pop choice for hardware implementation.

3. Related Piece of work on Hardware–Software Implementations

We accept surveyed dissimilar methods for hardware implementation of the HOG algorithm, including an extensive review of methods with hardware–software co-design in our previous work [16,17]. In this section, outset we briefly review the recent work implementing the HOG algorithm fully on hardware. Then, we review the piece of work using hardware–software co-design methodology.

3.1. Hardware Implementation of the HOG Algorithm

There are several implementations of the Hog algorithm using pure hardware [eleven,12,13,14]. One of the benefits of implementing the HOG algorithm on hardware is of course speed enhancement. Implementing the whole algorithm on hardware is benign when resource consumption is not a constraint.

Qasaimeh et al. [12] suggest a systolic architecture for hardware implementation of the Squealer algorithm. They speed upwardly the histogram generation past reusing the histogram bins generated for the side by side prison cell. For each sliding window position, they decrease the contribution of the previous cavalcade of pixels from the histogram and add the contribution of the side by side column to the histogram to generate the new histogram value. They speed up their pattern using this method and achieve 48 fps for 1920 × 1080 images.

Long et al. [xiii] suggest an ultra-high-speed implementation of the Squealer algorithm for object detection. They use a high-speed vision platform which contains a high-speed camera FASTCAM SA-X2. The vision platform sends 64 pixels per clock wheel to the Hog computation module as input. Instead of storing the Squealer values in a memory, they store but the maximum values of the HOG feature vector and its corresponding coordinates so as to simplify the computations of the further steps.

Ngo et al. [14] propose a long pipeline architecture for the HOG algorithm with 155 stages. Although their proposed system contains a processor and the FPGA part for the HOG algorithm, since they use the processor just for adding bounding boxes onto the output image we categorize this work as a hardware implementation of the HOG algorithm. In the HOG core, they use the CORDIC (coordinate rotation digital computer) algorithm for computing the magnitude and gradients. At the terminal stage subsequently computing the SVM score, they convert the fixed-signal score value to floating-point and send it to the processor.

Luo et al. [11] propose a pure FPGA implementation of the Sus scrofa algorithm. They make several contributions which increase the frame rate of their design. For the bin assignment step, they use a comparison-based method instead of computing the arctangent. This method reduces hardware resource usage, but still their design requires four DSP (digital indicate processing) cores for this part. They likewise suggest an architecture for reusing the calculations in the block normalization step and dividing the SVM calculation into fractional stages to decrease the overall latency.

Pure hardware implementations of the HOG algorithm accept the advantage of a higher speed of calculations. Naturally, they consume more hardware resource than the work which assign parts of the tasks to a software processing organization. There is a merchandise-off betwixt the speed of the algorithm and resource utilization, which can be fabricated based on the awarding and cost evaluation of the processing systems. In Department iii.2, we will review the piece of work based on hardware–software co-design of the HOG algorithm.

3.2. Hardware–Software Implementation of the HOG Algorithm

In this work, we focus on the designs which propose a hardware–software co-design arroyo. The master advantage of these methods is that the resource usage of the hardware can be optimized while preserving the required speed for the awarding.

Mizuno et al. [18] propose a cell-based scanning scheme for implementing the Sus scrofa algorithm. They take parallelized modules for prison cell histogram generation, histogram normalization and SVM classification. Their proposed parallel architecture increases the speed while consuming more hardware resource. Their work is a hardware–software co-design, equally they apply CPU to command the pipeline of the HOG algorithm. They simplify the Grunter ciphering by such methods equally using the CORDIC algorithm for slope calculation, using the Newton method for histogram normalization, and using specific bit-widths for different modules. They store the intermediate data of the histogram of the cells in SRAM retentiveness and load the information for the further steps. Ma et al. [half-dozen] propose a hardware–software co-design approach for HOG-SVM computation. They profile the lawmaking on CPU to find the nearly critical and computationally all-encompassing parts of the algorithm. As a result of their assay, they implement histogram generation and cake normalization on an FPGA. They store the result of block normalization in memory, and for the classification step, they re-load the normalized values from the memory. To minimize memory operations, they store the magnitude and orientation values of each unmarried pixel equally a 32-bit value in a single memory location. They advise a multi-scale design which computes Squealer for 34 scales. They resize the paradigm and compute the magnitude and gradient in software, and then store the result of this step in the retentivity on FPGA. In their pattern, the histogram generation and cake normalization steps are assigned to the FPGA, and the results are written back to the memory. Afterward that, the classification module loads the normalized histogram values from the retentivity and produces the concluding determination. They too apply dissimilar bit-widths for different modules, similar to Mizuno et al. [xviii], and then as to have a more efficient implementation. Rettkowski et al. [xix] advise a hardware implementation, a software implementation, and a hardware–software co-design of the HOG algorithm. They implement their design on a Xilinx Zynq^® platform, and for their hardware–software model, they use a Linux operating system on the lath. They as well compute the histogram generation and block normalization steps on the FPGA. They apply SDSoc^™ software, which is an IDE (integrated development environment) by Xilinx for implementing heterogeneous embedded systems, to generate hardware modules, and due to the software limitations, they produce the results for 350 × 170 pixel windows in their hardware–software implementation. However, for their pure hardware implementation, they procedure 1980 × 1020 images and achieve a college frame rate of 39.half dozen fps. Huang et al. [20] advise a hardware–software co-design of the algorithm by separating the nomenclature and Sus scrofa computation parts. In their blueprint, the Pig generation and ciphering is done on the FPGA, and the outcome is sent back to an ARM processor for the nomenclature step. In gild to improve classification, they use the Adaboost classifier first, followed by an SVM classifier to generate the last output. In the implementation by Ngo et al. [21], the classification step is done on software. They advise a sliding window architecture on hardware for the first part of the HOG algorithm. Bilal et al. [22] propose a simplification of the HOG algorithm by introducing a histogram of significant gradients. In their proposed method, only the gradients that take a value more than a threshold of average gradient magnitude of a block cast a binary vote to the histogram. Therefore, there is no need for a normalization step. They use HIK (histogram intersected kernel), which is a variation of the SVM as nomenclature module, and implement it on a soft processor.

Existing hardware–software approaches accept contributed significantly to the state-of-the-art, and inquiry is ongoing to make farther improvements. Some of the existing work, such as [eighteen], requires multiple external memory accesses for intermediate results, which can pb to increasing the latency of the design. Another important observation in the existing piece of work is that many include the processor in the period of the data-path [6,19,20,21,23]. This tin obviously go the clogging of the system, since the processor is usually slower than the programmable logic and processes data sequentially. In [twenty,21,22], the classification step is assigned to the software side of the system. Since nomenclature is part of the data flow and tin can offset equally soon every bit the first block is processed, assigning it to the hardware part is a superior option to increase the speed of the design. In this work, we propose a design which does not require any external memory access for computing an Squealer descriptor for each window. We classify the data-path of the algorithm to the programmable logic, and the control loops and address generation task to the processor. Therefore, the processor does non take negative impacts on the processing speed of the algorithm. In our pattern, we integrate feature extraction and nomenclature in a unified pipeline to increase the speed of the procedure.

4. A Novel Hardware-Software Co-Design of the Pig-SVM System

In this section, we propose a hardware–software co-design organization for HOG implementation. Equally a case report of the Pig algorithm'southward awarding, we choose human detection, which is an online application. The INRIA person dataset [15] is one of the more than commonly-used datasets for testing human detection approaches. In a real arrangement, the input information would be captured using a digital prototype sensor, and then converted to grayscale, earlier beingness passed to the Pig feature extraction unit of measurement. For evaluation purpose, we use the image data from the INRIA dataset for training the SVM classifier and testing our implementation. Nosotros validate our design on a Xilinx^® FPGA (Kintex^® Ultrascale^™) using Vivado^®. Our contributions are made in two master ways. First are the algorithmic level enhancements, which are the new ideas inside the Grunter-SVM core, including logarithm-based bin assignment, cake normalization and parallel histogram computation. Second is at the job allocation level, which assigns the advisable tasks to the processor system and programmable logic of the blueprint.

In a human detection system, a frame of an image is considered every bit the input. We use a sliding window technique, equally in [5]. We use an 800 × 600 image resolution and a moving window size of 160 × 96 on the image. The frame size and the window size are based on the work by Luo et al. [xi]. However, information technology could be readily changed for dissimilar applications. Nosotros extract the HOG features for all pixels and classify them using an SVM classifier. Since Sus scrofa feature extraction and nomenclature are computationally expensive, nosotros implement the Pig core in hardware in a fully pipelined manner. We allocate the prototype windowing step to software. This step is responsible for calculating the correct address of the epitome window in the memory and sending that accost to the HOG core.

The main parts of the proposed organization are the MicroBlaze^™ processor, a DMA (direct memory admission) core and an HOG-SVM core. The MicroBlaze^™ processor controls the chief procedure by issuing the start signal to the HOG-SVM core and sending the address of an image to the DMA module. We assume that the input paradigm is stored in the BRAM (block RAM) retentivity, which is the internal memory on the FPGA. This assumption is valid in multiple situations. At that place are many cases wherein other parts of a computer vision arrangement acquire the image data and accept loaded them beforehand in the BRAMs. In addition, since our main focus is on the architecture of the Grunter core, this assumption does non impact the main concept. Nosotros read the data from the BRAM in a raster scan streaming manner from the top left of the image to the bottom right. Nosotros divide each frame into several smaller windows, which tin take overlaps with each other based on the required configuration. For each frame, the processor sends the address of the starting time pixel of the beginning row of a window to the DMA. The DMA, which is connected to the memory and the HOG core, reads one row of pixels from memory and sends that row to the HOG core in a streaming aqueduct. The HOG core is designed using fixed-point numbers for efficiency. The core has two AMBA^® AXI interface ports. AXI is function of the ARM^® advanced microcontroller bus architecture, which provides a parallel high-performance interface. The get-go interface of the HOG core is based on the AXI light protocol, which is used for communications between the processor and HOG core. The second interface is an AXI stream protocol port which is connected to the DMA for high throughput data transfer. A simplified block diagram of the whole organization is shown in Figure 4. We use the UART port as a matter of convenience to write the examination image in the BRAM memory. Since the BRAM retention can be filled using various methods (depending on the application), this interface could exist replaced with another connection interface without affecting the primary concepts of this work.

An external file that holds a picture, illustration, etc. Object name is sensors-20-05655-g004.jpg

Block diagram of the proposed design and port connections.

When the DMA is moving data from the retentiveness to the HOG cadre, one pixel is sent to the core at each clock cycle. There is a finite state machine inside the Squealer core to command data receiving and processing. When new information are received, the cadre processes that data, and in the off times when the processor is sending the accost of the adjacent row (or the next window) to the DMA, the core enters a expect land. The whole system described in this section works with a maximum 150 MHz clock frequency. In the next department, nosotros discuss the details of the HOG-SVM cadre.

v. HOG-SVM Core

The overall diagram of the fully pipelined Squealer-SVM implementation is shown in Figure 5. The solid greyness bars represent the registers of the pipeline which we add to reduce the filibuster of the critical paths. The initial required time for filling the pipeline and generating the get-go input is 4.25 × W + 14 clock cycles, where W is the width of the image window. This initial setup time includes three × W clock cycles in the deserializer module, eight clock cycles in the one-row histogram generator module, West clock cycles in the ane-cell histogram buffers module, W/4 clock cycles in the 2-row histogram buffers module, and half-dozen clock cycles for the separation registers shown as gray solid confined in Figure v, which are added to reduce the critical timing path of the combinational logic. Gradient and magnitude, and the bin assignment modules, are combinational. The deserializer, one-row histogram generator, one-cell histogram buffers, and two-row histogram buffers all have internal registers and are fully pipelined at the pixel level. When data accomplish the terminal phase of the core, all modules work in parallel and there is no need to stop or filibuster the streaming input in this pipeline. After that, at each clock cycle, one valid SVM output is generated. In this section, nosotros describe the implementation details for each part and the novel contributions.

An external file that holds a picture, illustration, etc. Object name is sensors-20-05655-g005.jpg

The overall diagram of the HOG-SVM cadre.

5.1. Deserializer and Buffer Validity Check

The outset module of the Squealer cadre is the deserializer unit of measurement. This module contains 3 line buffers which accept the depth of the full prototype window. At every clock cycle, one pixel of the image is read and entered into the first register of the commencement line buffer, and the values of other registers are sent to the next next registers. For the terminal register of the first row, the adjacent register is the first register of the second row. Similarly, the value of the last register of the 2d row is sent to the offset register of the third row. After reading three rows of the epitome, all three buffers are total, and so we can compute the gradients in horizontal and vertical directions.

Effigy half-dozen shows the buffers in the deserializer module. The carmine registers at the end of the line buffer contain the output pixel values of this module, which are sent to the gradient module. The numbers in the first row bear witness the sequence of pixels entering the module. This module requires a setup time of 3 × Due west clock cycles to fill all registers before producing valid outputs.

An external file that holds a picture, illustration, etc. Object name is sensors-20-05655-g006.jpg

Line buffers in the deserializer module.

The buffer validity check module is a ready of counters which notice the input stream from the deserializer and issue control flow signals, to enable the gradient and magnitude calculation module and the histogram generator module. These signals are important in club to synchronize the flow of valid information in the pipeline.

5.ii. Gradient and Magnitude Calculation

After deserializing the input stream, gradients in the horizontal and vertical directions are computed in the gradient and magnitude module. The gradient is computed using two subtraction units that subtract the right pixel from the left i and the meridian pixel from the bottom one. The magnitude of the gradient, which is approximated past the add-on of the absolute values of gradients in horizontal and vertical directions, is obtained using 2 comparators and an adder unit. Since orientation computation and bin assignment are closely related to each other, nosotros design a single unit for this step. Computed gradients are sent to this module for bin consignment.

As mentioned in [17], the original Grunter algorithm requires ii × Westward × H multiplication operations (for calculating the square of the gradients twice for each pixel), W × H additions (once for each pixel), and W × H foursquare root operations (one time for each pixel) for calculating the magnitude of gradients, where Due west is the width and H is the meridian of the paradigm window. In our implementation, we simplified the magnitude computation by just performing Due west × H additions (for calculation the absolute values once for each pixel) and 2 × Due west × H inversion operations (for absolute value of the gradients twice per pixel).

5.iii. Logarithm-Based Bin Assignment

In this section, we introduce the new idea of logarithm-based bin assignment. The chief advantage of this method is that there is no demand to use multipliers, as in [10]. An embedded vision system could take multiple algorithms running simultaneously, and by not using multipliers we can salve resources, such as DSP (digital point processing) cores, for other parts of the organization. The thought backside this pattern originates from the characteristic of logarithm function, which can exist used to transform partition into subtraction. Equations (7)–(10) demonstrate the mathematical procedure of this method. Equation (7) presents the original orientation ciphering comparison. In the logarithm-based method, nosotros commencement compute the tangent of all values every bit in Equation (8). Then, we compute the absolute value and then the base 2 logarithm to all values, as in Equation (9). We do not lose any information by computing the absolute value, since we store the sign bit of $G_{y}$ for choosing the appropriate bin in the next step (we address this in detail afterward in this section). After, nosotros divide the dividend and divisor of gradients, equally shown in Equation (ten). Nosotros compute the $\log_{2} (| \tan (θ_{i}) |)$ offline, and just summate the $\log_{2} (| {One thousand}_{10} |)$ and $\log_{2} (| K_{y} |)$ values on the FPGA.

$\tan (θ_{i}) < \frac{G_{y}}{{One thousand}_{x}} \leq \tan (θ_{i + 1})$

(8)

$\log_{2} (| \tan (θ_{i}) |) < \log_{two} (| \frac{G_{y}}{G_{x}} |) \leq \log_{2} (| \tan (θ_{i + ane}) |)$

(9)

$\log_{2} (| \tan (θ_{i}) |) < \log_{2} (| M_{y} |) - \log_{two} (| G_{ten} |) \leq \log_{2} (| \tan (θ_{i + ane}) |)$

(10)

The reason that $\log_{2}$ is chosen in this method is because of the bigger slope that this function has in comparison with $\log_{10}$ or $\log_{e}$ . Figure 7 shows the difference in slopes among these functions. The greater the slope of the function is, the more differentiable the output is. By using the part with a greater slope, the precomputed logarithm values can so be scaled with a smaller ratio, thus minimizing quantization errors.

An external file that holds a picture, illustration, etc. Object name is sensors-20-05655-g007.jpg

Difference betwixt the slope of three logarithm functions.

To compute the $\log_{2}$ values, nosotros employ an LUT-based (Wait Up Table) RAM. Depending on the input value which is the computed gradient, nosotros can choose the appropriate $\log_{2}$ values, which are stored in the LUTs of the FPGA.

After retrieving the $\log_{2}$ values, we decrease $\log_{ii} (| G_{y} |)$ and $\log_{2} (| 1000_{x} |)$ from each other, and we find the appropriate bin based on the subtraction result. To make our design more accurate by taking into account the hardware resource restrictions, we scale the values so that we can prevent mantissa numbers. Equation (11) shows the scaled version of (10). Both sides of the inequality are precomputed, and for the middle expression, i addition to 160 is added. The reason for calculation 160 is that we multiply all sides of (viii) by 32, and and so compute the $\log_{two}$ of them. Then, we multiply the logarithm values by 32. Since ${32 \log}_{2} (32)$ is equal to 160, we add 160 to the middle expression. The LUT-based $\log_{2}$ is to calculate ${32 \log}_{2}$ instead of log₂, and only the absolute values of $G_{10}$ and $G_{y}$ are given to these LUTs as input.

$32 \log_{2} (32 | t a n (θ_{i} |) < 160 + 32 \log_{2} (| {One thousand}_{y} |) - 32 \log_{2} (| 1000_{ten} |) \leq 32 \log_{2} (32 | t a n (θ_{i + 1}) |)$

(11)

After that, the appropriate bin is selected using the sign flake, equally in Effigy 8. In this effigy, L1 to L5 represent precomputed limits for deciding the advisable bin. After the range of the number is determined, the advisable bin is selected according to the sign value. In Effigy 8, 5 is the term computed past subtraction of the logarithm values. Depending on the sign chip, a range of the orientations is called.

An external file that holds a picture, illustration, etc. Object name is sensors-20-05655-g008.jpg

The bin assignment process.

The limit values in Effigy eight are shown in Table 1. These limits are precomputed values of ${32 \log}_{2} (\tan (θ_{i}))$ , where $θ_{i}$ is the bin limit betwixt −ninety and ninety degrees.

Table 1

Limits for bin assignment.

Limits	Value
L1	80
L2	135
L3	168
L4	207
L5	347

The pseudo-code for the bin consignment step is shown in Algorithm 1.

Algorithm 1 The pseudo-lawmaking for the bin assignment step

Summate the absolute values of ${Chiliad}_{y}$ and $G_{x}$
Store the sign of ( $M_{y}$ )×( ${Chiliad}_{x}$ ) in the sign bit
Calculate the scaled logarithms of $M_{y}$ and $G_{10}$
Based on the log value, map to the −xc to 0 degrees bins if the sign bit is negative
Based on the log value, map to the 0 to +90 degrees bins if the sign chip is positive

It is important to note that for this part, each bin is represented with 12 bits to maintain the accuracy. Every bit mentioned in [17], in the original Grunter algorithm, the orientation and bin assignment module require Due west × H arctangent operations, W × H divisions, and 9 × West × H comparison operations. Although some previous works [ten] use 18 × W × H multiplication operations instead of arctangent, past using our method, the bin assignment module does non use whatsoever multipliers. It computes the advisable bin only by using W × H subtractions (for the $\log_{2} (| G_{y} |)$ and $\log_{2} (| {Grand}_{10} |)$ subtraction), 9 × W × H comparisons (for bin assignment) and two × Due west × H inversions (for the absolute value of gradients), and reading values from LUTs. Every bit a result, multipliers and DSP units are saved for other possible processes required in the vision system.

v.4. I-Row Histogram Generator

We describe the implementation of a ane-row histogram generator unit in this section. This module gets the magnitude and bin consignment inputs from the previous modules. And so, co-ordinate to the orientation related to each magnitude, a histogram is created for every eight pixels. Calculating the histogram requires more than eight clock cycles. This module contains nine registers representing each bin. In the commencement eight clock cycles, the input enters this module, and the value of each bin is added to the advisable annals, representing an orientation bin. This module requires one clock bike to output the completed fractional histogram, and one clock wheel to reset the registers to null again to get ready for the side by side incoming pixels. Since computing histograms in this way requires the input data stream to pause, we design this pace by using two fractional histogram generators, which work in parallel using a time-sharing protocol. Equally illustrated in Figure nine, the input divider sends a valid magnitude and bin number to the compute histogram modules, and the multiplexer at the end chooses the valid histogram based on the time-sharing protocol. Figure 10 demonstrates how the time-sharing protocol works for each 8 pixels entering the one-row histogram generator module. Each module requires eight clock cycles to create the histogram, i clock wheel to put it on the output port and one clock cycle to reset the registers. At the 9th clock bike the output is valid, and at the tenth clock cycle, nosotros reset the registers. While one of the compute histogram modules is in output and reset phase, the other one gets the input stream of data and continues the process. Therefore, there is no need to cease the streaming input. Otherwise, we should pause the streaming input for ane bike for each cell calculation, which could irksome a design, especially when processing high-resolution frames.

An external file that holds a picture, illustration, etc. Object name is sensors-20-05655-g009.jpg

I-row histogram generation module.

An external file that holds a picture, illustration, etc. Object name is sensors-20-05655-g010.jpg

The time-sharing protocol for the histogram generation module.

5.5. One-Cell Histogram Buffers

We apply the aforementioned architecture proposed by Luo et al. [11] for designing one-jail cell histogram buffers and 2-row histogram buffers. The one-cell histogram buffers module computes the histograms of each eight pixels in a row. The output of this module is nine 16-bit bins of eight pixels in a row every clock wheel. Since our goal is to compute the histogram for 8 × eight cells, we use histogram buffers to store the computed histograms sent from the 1-row histogram generator module.

This module contains two parts. The first office has eight lines of buffers. Each line has viii buffers. At each clock cycle, a histogram of eight pixels enters this module into the first line buffer. And so, the line buffers piece of work as a shift register, and at each clock wheel, the values are moved through the line buffers. When the first entry of the line buffers reaches the last register, the information in the concluding annals of each line are the histograms of viii pixels of each row of a cell. Therefore, by adding them together bin by bin, we can derive the histogram of a cell. On the side by side clock bicycle, the histogram of the next prison cell is computed. This process continues until the cell line in the paradigm is changed. While the line buffers are loading upwardly again, their output is not valid.

Figure 11 illustrates the eight line buffers of this module. Nosotros use a tree-based adding construction to minimize the disquisitional path of the combinational logic for improver. Since we have eight arguments from viii buffers, the tree-based adding construction volition have three levels. Therefore, by using a three-level tree-based adding construction, the histogram of a cell can be computed efficiently. This module requires Due west/8 clock cycles to fill the outset row of the buffers, since the input of this module is a histogram computed for eight pixels. Since there are eight rows in this module, a full number of W clock cycles is required to fill the buffers of this module and generate the first valid output.

An external file that holds a picture, illustration, etc. Object name is sensors-20-05655-g011.jpg

The block diagram of one-jail cell histogram buffers.

5.half-dozen. Ii-Row Histogram Buffers

The next stage is the two-row histogram buffers. The objective of this module is to deserialize the computed cells to take access to four side by side cells in parallel. Figure 12 shows the block diagram of this module. At each clock cycle, if the input is valid, a nine-bin histogram enters these line buffers. When the start cell which has entered this module reaches the last register, nosotros have the histograms of the cells of two cell rows set at the same time. These values are the output of this module. This module requires a setup time of two × Westward/eight clock cycles to generate the starting time valid output, since there are two rows and we have W/8 registers in each row.

An external file that holds a picture, illustration, etc. Object name is sensors-20-05655-g012.jpg

The block diagram of two-row histogram buffers.

5.seven. Cake Normalization

The next phase of this pattern is block normalization. For accurate implementation, if nosotros desire to accept a latency of i clock cycle for the normalization, 36 multipliers, one square root operation and one division are required. The other possible design is to utilize ane multiplier and compute the square operation once in each clock cycle, which will add 36 clock cycles for the normalization of each block. In this piece of work, we propose a simplified design for the block normalization stride. Our design normalizes each histogram bin so that the summation of all bins in a cake is less than a specific threshold. Choosing a larger value for this threshold will outcome in less approximation and therefore more accuracy. Notwithstanding, it volition eat more hardware resources, since we must dedicate more than bits to the upshot. We choose 255 for this limit as a merchandise-off between accuracy and resource usage. In addition, nosotros utilize division past powers of 2, which simply shifts the input value and is much less resource-consuming than other sectionalisation algorithms.

Figure 13 shows the cake diagram of the normalization module. The block normalization module receives two histograms from 2 cells in one cell column at each clock cycle, and stores them in the summit-left and bottom-left registers. Since the normalization is done for every four cells, this module stores the two inputs for a clock bike in the peak-right and lesser-right registers. In the subsequent clock bicycle, when all four histograms are ready, the block normalization module computes the normalized value. Kickoff, all bins of the four histograms are added to each other using a tree-based adding structure. Then, depending on the four nearly pregnant bits of the sum value, a step-based normalization is adapted using a decoder. Then, each histogram is shifted using a barrel shifter.

An external file that holds a picture, illustration, etc. Object name is sensors-20-05655-g013.jpg

The block diagram of block normalization module.

We aim to limit the summation of each block to 255, as shown in Table 2. Therefore, depending on the location of the near significant flake that is a '1' in the sum value, all histogram values are divided. If the value of the summation is more than 2047, we divide all histograms by 16. If it is less than that, depending on the bit number, we shift the histograms to the right (each shift divides by two) in club to keep the summation in the range of 0 to 255.

Tabular array 2

Block normalization decoding method.

Limits of Values for Sum	$.25 of the Summation	Division of All Histogram Bins	Number of Bits to Shift the Histograms
Sum > 2047	Bit eleven is checked	Histogram/16	Histogram >> 4
2048 > Sum > 1023	Bit 10 is checked	Histogram/8	Histogram >> 3
1024 > Sum > 511	Bit 9 is checked	Histogram/4	Histogram >> ii
512 > Sum > 255	Flake 8 is checked	Histogram/ii	Histogram >> 1
Sum < 256	---	Histogram	Histogram

We tin do good from checking ane chip of the summation by using binary values for comparison and division. We also perform division by shifting the histogram values, and therefore avoid a circuitous divider circuit. As mentioned in [17] in the original HOG algorithm, the cake normalization stride requires ix × C multiplication (for the square of each histogram bin), addition and division operations, and B square root operations, where C is the total number of cells and B is the total number of blocks in an paradigm window. Our simplification results in having 35 × B addition operations (for the adder tree) and 36 × B shifting operations (for four cells in each block).

v.8. SVM Classifier

The final function of the HOG-SVM cadre is the SVM classifier. In this stage, the output of the block normalization step is given as an input. Since four histograms are normalized at each clock cycle, the SVM module gets four nine-bin histograms equally input at in one case. These histograms are given to the four SVM blocks in this module, as shown in Effigy 14.

An external file that holds a picture, illustration, etc. Object name is sensors-20-05655-g014.jpg

The SVM classifier module.

Each of the four parallel SVM blocks contains an SVM RAM, which holds precomputed weights for the SVM classifier. In each SVM block, the input histograms are multiplied bin by bin to the trained weights of the SVM classifier, and their results are added together in four accumulators. The internal logic of an SVM cake is illustrated in Figure 15. This unit of measurement contains an SVM RAM which has the precomputed weights. Nine multipliers are working in parallel in each SVM block module. Finally, when all the information are candy, the values of the accumulators and the bias term of the SVM classifier are added, which is the final score of the SVM classifier. By comparing this score with a predefined threshold, the SVM volition indicate if the image window is a positive or negative sample. If the score is more than the threshold, the label is 1, and otherwise, it is zero. In terms of the number of operations, the SVM classifier module requires B comparisons, 36 × B multiplications and twoscore × B addition operations, where B is the total number of blocks in an image window.

An external file that holds a picture, illustration, etc. Object name is sensors-20-05655-g015.jpg

SVM block internal logic.

six. Results and Comparison with Other Piece of work

Rettkowski et al. [nineteen] were among the first to demonstrate the speed gain of a pure hardware implementation of the Sus scrofa algorithm over a software implementation. Pure hardware implementation consumes more than resources than when some role of the ciphering is done on the processor. Nonetheless, computational approximations in hardware implementations tin can lead to some accuracy loss. The hardware–software co-pattern provides a merchandise-off between preserving the accuracy and limiting hardware resource usage. Therefore, such a design should be compared to other hardware–software designs which are facing the same trade-off in order to take a fair comparing.

Unlike most previous piece of work [half dozen,xix,xx,21,22,23], in our design, the period of the data does not include the processor itself. This is important since in those cases, the processor would exist the bottleneck of the organisation. In add-on, the Hog-SVM cadre is designed and so that no memory admission is required for intermediate computations, equally in [18]. Intermediate communications with external off-chip retentivity reduce the speed of the system. In our design, everything is buffered using on-bit FPGA resource. Some other advantage of our design is that when the first block of the normalized histograms is ready, the classification step starts, and at each clock cycle, one cake of data is given to the classifier. Classification is part of the information flow, and assigning it to software as in [22] could decrease performance. At the algorithmic level, we brand three contributions. Past using the logarithm-based bin consignment, nosotros save four multipliers (DSP units), which can be used for other possible computations or applications on the aforementioned chip. By using a simplified cake normalizer, nosotros save 36 multipliers, one division unit and one foursquare root operation. In improver, past employing parallel histogram computation, we save 20% of the fourth dimension for each histogram's generation. We provide the results of our implementation and comparison with other work in Tabular array iii. The numbers provided by other work in this table are obtained from their published results.

Table 3

Comparison with other piece of work.

	Reference	FPGA	Image Size	LUTs	BRAM (Kbit)	DSP	Frame Rate (fps)	Pixel per Clock Cycle
Pure Hardware Design	Rettkowski et al. [nineteen] ⁱ	Zynq^®	1920 × 1080	41,858	1584	xiii	39.six	0.99
	Ngo et al. [14]	Cyclone^® Five	640 × 480	13,646	317	38	75	0.46
	Long et al. [13]	Stratix^® IV	512 × 512	266,023	47	236	2500	viii.19
	Luo et al. [11]	Cyclone^® Four	800 × 600	sixteen,060	334	69	162	0.51
	Qasaimeh et al. [12]	Zynq^®	1920 × 1080	32,871	NA	130	48	0.59
Hardware–Software Co-pattern	Mizuno et al. [eighteen]	Whirlwind^® IV	800 × 600	34,403	334	68	72	0.86
	Ma et al. [6]	Virtex^®-six	640 × 480	184,953	13737	190	68	0.xiv
	Bilal et al. [22]	Whirlwind^® Iv	640 × 480	65,501	103	ten	25	0.15
	Yu et al. [23]	Spartan^®-vi	640 × 480	15,167	351	19	1.v	NA
	Rettkowski et al. [19] ¹	Zynq^®	350 × 175	NA	NA	NA	0.44	0.0001
	Ngo et al. [21]	Cyclone^® V	640 × 480	12,138	437	65	11	0.02
	Hunag et al. [20]	Spartan^®-six	384 × 288	NA	NA	NA	25	NA
	Our HW-SW co-design	Kintex^® UltraScale^™	800 × 600	7804	756	36	115	0.37

The concluding column of Table 3 demonstrates the metric of pixel per clock bike. Our proposed pattern has a larger pixel per clock cycle value than most of the other hardware–software methods. The work by Long et al. [13] has the highest pixel per clock bicycle value. The reason is that in [13], the input of the system is 64 pixels per clock cycle, while others receive one pixel per clock cycle as an input. The work by Mizuno et al. [18], which achieves the highest pixel per clock cycle value in the hardware–software co-design work (due to their highly parallel architecture), uses about twice the number of DSPs and about 4 times more than LUT resource than our proposed design for the same image resolution. In that sense, our design is more than efficient in the instance of resource usage and, after the initial setup time, tin can produce a valid output at each clock wheel. Pure hardware implementations are typically faster than hardware–software implementations. However, they will oftentimes require more hardware resources every bit a trade-off.

As shown in Table 3, Rettkowski et al. [19] employ a Zynq^® family FPGA, while Ma et al. [6] employ a Virtex^® serial FPGA. Mizuno et al. [xviii], Bilal et al. [22] and Ngo et al. [21] use Cyclone^® family FPGAs. Cyclone^® V devices have more available memory, while Virtex^® family FPGAs have more logic elements than Cyclone^® and Zynq^® serial. The latest FPGAs and technologies should pb to faster systems, withal innovative implementation is also a big driving factor in making an constructive and efficient system. Ma et al. [6] implement HOG in 34 scales but use the FPGA resource more than extensively than other work.

The results of Table 3 indicate that our organisation uses a comparable number of DSPs and BRAMs in processing images of similar size, and fewer LUT resources than other work which implement hardware–software co-design systems and pure hardware systems. The frame rate mentioned in Table 3 is for the instance in which there is no overlap between sliding windows. If nosotros increment the number of overlapped pixels (or decrease pixels stride) the frame charge per unit decreases. Pace is the number of pixels betwixt the current window and the next window in ane direction. Figure 16 demonstrates the relationship between frame charge per unit and pixel footstep in a logarithmic scale. This effigy shows that in that location is a virtually linear relationship between frame rate and pixel step.

An external file that holds a picture, illustration, etc. Object name is sensors-20-05655-g016.jpg

The relation between pixel stride and frame rate.

Since the sliding window part of the system only has the responsibility of calculating the correct address of the windows, it is reasonable to choose the processor for this chore. On the other hand, Grunter and SVM calculations, which require many additions, multiplications and comparisons, are more efficient using hardware. Our design is well-suited for applications, such as mobile and embedded systems, where in that location is a limitation in hardware resources. By minimizing the usage of hardware resources past Hog and SVM, in that location are more than resources available for other parts of an application, and we tin withal go accurate and comparable results. Table 4 illustrates the resource usage of all parts of the HOG-SVM IP-core.

Table four

Grunter-SVM IP-core resources.

Module Name	LUTs	DSP
De-serializer	117	0
Buffer validity bank check	62	0
Slope and Magnitude	8	0
One-row histogram generator	502	0
One-cell histogram buffers	1960	0
Ii-row histogram buffers	376	0
Block normalizer	799	0
SVM classifier	1622	36
Overall	5658	36
Per centum used ¹	ane.06%	ane.87%

Nosotros present the resources usage of the whole system in Table 5. The reset and clock module is responsible for creating the required clock frequencies, and distributing clock and reset signals to all parts of the pattern. MicroBlaze^™ is the main processor, which contains local memory, a debug module, a peripheral controller and an interrupt controller. Nosotros use AXI Information FIFO to buffer the streaming information from the DMA module to the Sus scrofa-SVM IP-core.

Table 5

Resource usage of the whole hardware–software system.

Module Name	LUT	Block Ram Tile (36Kbit)	DSP
Reset and Clock	17	0	0
Microblaze^™	1114	0	0
Microblaze^™ local memory	xi	16	0
Microblaze^™ Debug Module	156	0	0
Microblaze^™ Peripheral Controller	179	0	0
Microblaze^™ Interrupt Controller	69	0	0
AXI Data FIFO	56	0.5	0
DMA	544	4.5	0
HOG-SVM core	5658	0	36
Sum	7804	21	36
Percentage used ⁱ	ane.47%	iii.5%	1.87%

To measure out the speed of the pattern, nosotros load the input image into the BRAM retention of the FPGA. In our experiments nosotros utilize 800 × 600 images so as to exist comparable with other hardware–software co-design work, since published results are generally at this resolution. However, using a higher prototype resolution such every bit 1920 × 1080 does non impact our implementation in terms of resource usage, since the required resources are based on the prototype window size and not the whole image. The processor starts the computation by instructing the DMA to read from the retention and send the information to the Pig-SVM IP-core. Since in a practical application an external memory can be used and the image tin can have any arbitrary size, nosotros did not report the number of BRAM memories defended to the prototype stored on the FPGA in Table v, as it is not one of the principal elements of the proposed arrangement.

The bandwidth of the designed streaming channel between the memory and the Grunter-SVM IP-core is 1.2 Gbit/s, since the DMA can send each pixel in one clock bicycle to the Grunter core. In our design, for each line of the image, the processor sends a control to DMA to start the data transfer for a specific number of pixels. Although this controlling mechanism gives the system the capability to process dissimilar sizes of the prototype, it adds an overhead to the timing. Therefore, the data rate of the transfer between the memory and the HOG-SVM IP-cadre is decreased to 55 Mbit/s, based on our measurements.

In terms of the number of operations, as mentioned in item in Section 5, our proposed pattern has reduced the W × H + ix × C additions, 2 × W × H + 9 × C multiplications, W × H arctangent operations, W × H + 9 × C divisions and Westward × H + B square root operations in the original Pig algorithm to two × Due west × H + 35 × B additions, 4 × W × H inversions, 9 × W × H comparisons and 36 × B shifting operations, where C is the full number of cells in an image window. These numbers exclude the parts which were similar, such as the operations required by the SVM module. The SVM module requires B comparisons, 36 × B multiplications and twoscore × B add-on operations, where B is the total number of blocks in an image window.

We used a hardware model in MATLAB^® for evaluating the accuracy of the design. The hardware model produces identical results to the bodily implementation on the FPGA. This procedure is similar to the work by Luo et al. [11]. The accuracy results of our organisation are shown in Effigy 17. It can be observed that for the test set up of the INRIA dataset, the accuracy of our design is very close to, but slightly lower than, that of the software implementation of the algorithm, which is due to the quantization of the floating-point values and simplifications in hardware. Effigy 17 demonstrates miss rate versus false positive per window, which is the most common method for evaluating human detection systems. The vertical axis shows the miss charge per unit and the horizontal axis represents the number of false positives per window. This diagram is typically drawn in a log–log scale. The software version is our implementation of the HOG-SVM, using MATLAB^® software and the Statistics and Machine Learning Toolbox based on [5].

An external file that holds a picture, illustration, etc. Object name is sensors-20-05655-g017.jpg

Comparison of software implementation and our proposed HW-SW co-design.

7. Conclusions

In this work, we proposed a hardware–software co-design system of the HOG algorithm, which can receive input data from a digital image sensor, excerpt the Hog features and make a decision based on those features. Our implementation makes four principal contributions. Kickoff, at the job allocation level, we suggest a well-organized partition between different parts in a hardware–software co-design organisation, which consumes fewer FPGA resource than other comparable hardware–software systems. The thought is to assign the computationally intensive parts of the algorithm, such as gradient and magnitude computation, bin assignment, normalization and classification, to hardware, and delegate the resource-intensive part, which is the windowing stage, to software. Second, as an algorithmic-level contribution, to the best of our noesis, we are the first to suggest a logarithm-based bin consignment in the Squealer algorithm, which leads to a multiplier-free implementation of the HOG and reduces the overall number of multipliers for the HOG-SVM cadre. Third, we propose to use two parallel histogram computation modules, which save one clock wheel for every 8 pixels. Every bit a result, the HOG core can accommodate the pixel data in a streaming manner on each clock cycle without any suspension. Finally, we propose a simpler implementation of the block normalization footstep, which reduces the IP-cadre resources.

Our pattern has the capability to use several Grunter-SVM IP-cores in parallel for one image. In future, nosotros tin can modify the pattern to accept advantage of this feature and heighten the speed of the system. Another possibility is to apply interrupts efficiently to read precomputed window addresses from the memory. In this fashion, the processor would take more free time to perform other tasks while the HOG-SVM cores and DMAs are processing the prototype. Another possible enhancement involves developing other variants of the Squealer algorithm and their implementation in hardware. In that location are many other variants of the Pig algorithm, such as Sus scrofa-3d [24], which require a high number of computations and can benefit from parallel implementation.

Writer Contributions

Methodology, S.Chiliad. and P.South.; software, South.G. and P.South.; validation, S.One thousand. and P.S.; writing—original draft preparation, Southward.G. and P.S.; writing—review and editing, S.G., P.S., 1000.F.L. and D.West.C.; supervision, K.F.L. and D.W.C.; All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Doctoral Fellowships from the Academy of Victoria, and Discovery Grants #36401 and #04787 from the National Science and Engineering Enquiry Quango of Canada.

Conflicts of Interest

Authors declare no disharmonize of involvement. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

i. Priyanka, Kumar D. Feature Extraction and Selection of Kidney Ultrasound Images Using GLCM and PCA. Procedia Comput. Sci. 2020;167:1722–1731. doi: x.1016/j.procs.2020.03.382. [CrossRef] [Google Scholar]

2. Djeziri One thousand., Benmoussa S., Zio Eastward. Artificial Intelligence Techniques for a Scalable Energy Transition. Springer; Berlin/Heidelberg, Germany: 2020. Review on Health Indices Extraction and Trend Modeling for Remaining Useful Life Interpretation; pp. 183–223. [Google Scholar]

3. Lowe D. Distinctive Image Features From Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004;60:91–110. doi: 10.1023/B:VISI.0000029664.99615.94. [CrossRef] [Google Scholar]

four. Bay H., Tuytelaars T., Van Gool 50. Computer Vision—ECCV 2006. Springer; Berlin/Heidelberg, Germany: 2006. SURF: Speeded Upward Robust Features; pp. 404–417. [Google Scholar]

5. Dalal N., Triggs B. Histograms of Oriented Gradients for Man Detection; Proceedings of the 2005 IEEE Figurer Society Conference on Computer Vision and Pattern Recognition (CVPR'05); San Diego, CA, Usa. xx–25 June 2005. [Google Scholar]

6. Ma X., Najjar Westward., Roy-Chowdhury A. Evaluation and Acceleration of High-Throughput Stock-still-Point Object Detection on Fpgas. IEEE Trans. Circuits Syst. Video Technol. 2015;25:1051–1062. [Google Scholar]

seven. Montalvo V., Estévez-Bén A., Rodríguez-Reséndiz J., Macias-Bobadilla G., Mendiola-Santíbañez J., Camarillo-Gómez Chiliad. FPGA-Based Architecture for Sensing Ability Consumption On Parabolic And Trapezoidal Motion Profiles. Electronics. 2020;9:1301. doi: ten.3390/electronics9081301. [CrossRef] [Google Scholar]

eight. Zhang X., Wei X., Sang Q., Chen H., Xie Y. An Efficient FPGA-Based Implementation for Quantized Remote Sensing Prototype Scene Nomenclature Network. Electronics. 2020;9:1344. doi: 10.3390/electronics9091344. [CrossRef] [Google Scholar]

9. Zhao G., Hu C., Wei F., Wang K., Wang C., Jiang Y. Real-Time Underwater Image Recognition with FPGA Embedded System for Convolutional Neural Network. Sensors. 2019;19:350. doi: 10.3390/s19020350. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

10. Blair C., Robertson North. Video Anomaly Detection in Real Fourth dimension on a Power-Enlightened Heterogeneous Platform. IEEE Trans. Circuits Syst. Video Technol. 2016;26:2109–2122. doi: 10.1109/TCSVT.2015.2492838. [CrossRef] [Google Scholar]

11. Luo J., Lin C. Pure FPGA Implementation of an Grunter Based Existent-Time Pedestrian Detection System. Sensors. 2018;18:1174. doi: x.3390/s18041174. [PMC gratis article] [PubMed] [CrossRef] [Google Scholar]

12. Qasaimeh Grand., Zambreno J., Jones P. A Runtime Configurable Hardware Architecture for Computing Histogram-Based Feature Descriptors; Proceedings of the 2018 28th International Conference on Field Programmable Logic and Applications (FPL); Dublin, Ireland. 27–31 August 2018. [Google Scholar]

13. Long X., Hu S., Hu Y., Gu Q., Ishii I. An FPGA-Based Ultra-High-Speed Object Detection Algorithm with Multi-Frame Information Fusion. Sensors. 2019;19:3707. doi: ten.3390/s19173707. [PMC free commodity] [PubMed] [CrossRef] [Google Scholar]

14. Ngo 5., Castells-Rufas D., Casadevall A., Codina K., Carrabina J. Low-Power Pedestrian Detection System on FPGA. Proceedings. 2019;31:35. doi: 10.3390/proceedings2019031035. [CrossRef] [Google Scholar]

16. Ghaffari S., Soleimani P., Li K., Capson D. FPGA-Based Implementation of HOG Algorithm: Techniques and Challenges; Proceedings of the 2019 IEEE Pacific Rim Briefing on Communications, Computers and Signal Processing (PACRIM); Victoria, BC, Canada. 21–23 August 2019. [Google Scholar]

17. Ghaffari S., Soleimani P., Li G., Capson D. Analysis and Comparison of FPGA-Based Histogram of Oriented Gradients Implementations. IEEE Access. 2020;8:79920–79934. doi: 10.1109/ACCESS.2020.2989267. [CrossRef] [Google Scholar]

18. Mizuno Thou., Terachi Y., Takagi K., Izumi S., Kawaguchi H., Yoshimoto Thousand. Architectural Report of HOG Feature Extraction Processor for Real-Time Object Detection; Proceedings of the 2012 IEEE Workshop on Signal Processing Systems; Quebec City, QC, Canada. 17–19 October 2012. [Google Scholar]

19. Rettkowski J., Boutros A., Göhringer D. HW/SW Co-Pattern of the HOG Algorithm on a Xilinx Zynq Soc. J. Parallel Distrib. Comput. 2017;109:50–62. doi: 10.1016/j.jpdc.2017.05.005. [CrossRef] [Google Scholar]

twenty. Huang Due south., Lin Due south., Hsiao P. An FPGA-Based Grunter Accelerator with HW/SW Co-Pattern for Human Detection and Its Application to Crowd Density Estimation. J. Softw. Eng. Appl. 2019;12:1. doi: 10.4236/jsea.2019.121001. [CrossRef] [Google Scholar]

21. Ngo Five., Casadevall A., Codina M., Castells-Rufas D., Carrabina J. A High-Performance Pig Extractor on FPGA. [(accessed on 23 Baronial 2020)]; Available online: https://arxiv.org/abs/1802.02187.

22. Bilal Chiliad., Khan A., Khan M., Kyung C. A Low-Complexity Pedestrian Detection Framework for Smart Video Surveillance Systems. IEEE Trans. Circuits Syst. Video Technol. 2017;27:2260–2273. doi: 10.1109/TCSVT.2016.2581660. [CrossRef] [Google Scholar]

23. Yu Z., Yang S., Sillitoe I., Buckley K. Towards a Scalable Hardware/Software Co-Design Platform for Real-Time Pedestrian Tracking Based on a ZYNQ-7000 Device; Proceedings of the 2017 IEEE International Briefing on Consumer Electronics-Asia (ICCE-Asia); Bangalore, India. 5–7 October 2017. [Google Scholar]

24. Dupre R., Argyriou V. 3D Voxel Sus scrofa and Take chances Estimation; Proceedings of the 2015 IEEE International Conference on Digital Signal. Processing (DSP); Singapore. 21–24 July 2015; New York City, NY, USA: IEEE; 2015. [Google Scholar]