^{1}

^{1}

^{*}

^{1}

In our previous work, a novel algorithm to perform robust pose estimation was presented. The pose was estimated using points on the object to regions on image correspondence. The laboratory experiments conducted in the previous work showed that the accuracy of the estimated pose was over 99% for position and 84% for orientation estimations respectively. However, for larger objects, the algorithm requires a high number of points to achieve the same accuracy. The requirement of higher number of points makes the algorithm, computationally intensive resulting in the algorithm infeasible for real-time computer vision applications. In this paper, the algorithm is parallelized to run on NVIDIA GPUs. The results indicate that even for objects having more than 2000 points, the algorithm can estimate the pose in real time for each frame of high-resolution videos.

In our previous work, a novel pose estimation algorithm based on points to region correspondence was proposed in [

The pose estimation algorithm, described in [

In this paper, we propose a GPGPU parallelization approach for the pose estimation algorithm. The goal is to achieve optimal and scalable implementation of the pose estimation algorithm on GPUs, making it suitable for real-time applications. To achieve our goal, we parallelize the algorithm in two phases-design and implementation. In design phase, we analyze the algorithm, determine the bottleneck, re-factor the algorithm and data to enhance parallelism. For implementation phase, we follow the standard guidelines for optimization on GPU prescribed by NVIDIA and GPU experts, and fine tune GPU architecture parameters such as threading, blocks, streams etc. to obtain the near optimal implementation. Our method of parallelizing the algorithm first via design, and then via implementation, makes our effort of parallelization different from other works.

The original work presented in [

The focus of our research work is to parallelize the algorithm in [

The purpose of our work is to extend the work in [

This paper is organized as follows: Section 2 describes an implementation of the algorithm provided in [

The implementation description of the pose estimation algorithm introduced in [

a) Monocular pose estimation is considered.

b) There is a relative motion between the camera and object.

c) There are at least four markers per object.

d) Camera captures images in the form of video frames, with at least 24 frames per second (fps). The fps is assumed to increase with an increase in the rate of relative motion between the camera and the object.

e) It is a continuous pose

the angular velocity relative to the camera and object,

f) The velocity, v, is measured in spherical coordinates. Spherical coordinates

angle

Since OPEA is an iterative 2D search algorithm, it explores a 2D space for all possible values of the pose of v and ω. The search begins by assuming a value of v, where v in spherical coordinates is given by

and

guous in the image data. It is impossible to find r. Consequently, it is set to be r

= 1. For each assumption of v, a corresponding ω is calculated. The error in

the pose. Since the value determining v have boundary conditions

the pose estimated, i.e., the smaller the incremental value of

Consider two images taken from a single camera with relative velocity between the image and the object. Let the image coordinates of markers on the first image be

where I represents an identity matrix and

Once a value of v is assumed, a corresponding value for ω is calculated as follows:

The difference

Thus, the plot of J versus

If the depth of all points has the positive numerical sign and the value of J corresponding to the same

We begin this section by analyzing the calculations in OPEA. The analysis is split into two phases. The first phase analyzes code refactoring of OPEA to suit parallelization. However, the refactored code can be used as a sequential or parallel code. The second phase analyzes the data reorganization and code for a parallel implementation of the refactored code, to match the GPU architecture.

The first phase of analysis begins by splitting the calculations in OPEA into three steps. Equations (1)-(5) are taken to be the “compute velocity” step. Equations (8), (9) are taken to be “compute error” step. Lastly, Equation (10) is taken to be the “selection of pose” step. If the total number of iterations in each dimension is represented by k,

Hence the execution time of OPEA increases linearly with an increase in m_{f} but a quadratic increase with an increase in k.

a) v dependent computations

b) v independent computations

Since only v dependent computations are needed inside the iteration, the v independent computations can be pre-computed outside the iterations, thus reducing the total FLOP count of OPEA.

We now provide an advanced version of OPEA, referred to as AOPEA, which refactors code to suit parallelization better. To facilitate code refactorization, we introduce two new operations

a) Stackoperation: Convert 3 × 3 symmetric matrix elements to a vector.

b) Unstackoperation: Convert 6 × 1 vector into 3 × 3 symmetric matrix.

Step | FLOP count |
---|---|

Compute Velocity | 138m_{f} + 49 |

Compute Error | 45m_{f} + 11 |

Selection of Pose | 5m_{f} |

Per Iteration | 188m_{f} + 60 |

OPEA Algorithm | (188m_{f} + 60)k^{2} |

The computation of B, C and Q_{s} are performed in two parts. The v independent computations, computed only once for a given set of markers, represented by_{p} are calculated as shown below:

where,

where _{s} are provided below:

where,

The new FLOP count for the AOPEA is provided in

The steps for selection of pose, discussed in Equation (10), can be skipped if J is greater than an already computed minimum value. Hence, neglecting the FLOP count for the same, the total FLOP count for AOPEA is 177m_{f} + 216k^{2} ? 24. The separation of m_{f} and k^{2} computational steps facilitates an increase in execution time exclusive to either the number of markers or accuracy. For example, a real-time application requiring higher accuracy can have lower m_{f} and high k value, whereas a real-time application requiring higher m_{f} can have low k iteration count. In both cases, only the computations relevant to each are increased.

To reduce the overhead of iterations further; the v dependent computations are performed twice-first with coarse accuracy with high increments for each iteration, and second with finer accuracy with smaller increments for each iteration. Coarse accuracy iterations, having smaller values of k, provide a coarse estimation of

The second phase of analysis looks at the implementation of AOPEA as a parallel algorithm. It begins with looking at data reusability for the algorithm.

Step | v Independent FLOP count | v Dependent FLOP count |
---|---|---|

Compute Velocity | 141m_{f} − 12 | 160 |

Compute Error | 36m_{f} − 12 | 56 |

Selection of Pose | 0 | 5 m_{f} |

Per Iteration | 0 | 5 m_{f} + 216 |

AOPEA Algorithm | 177m_{f} ? 24 + (5m_{f} + 216)k^{2} |

For the v independent computations, we arrange the data into two matrices such that one matrix-matrix multiplication provides the results for B_{P}, Q_{sp} and m_{p}. We perform a global synchronization after v independent computations are completed. The results thus obtained are re-arranged into a single matrix, and distributed (data) to many-cores on GPUs, known as streaming processors. This arrangement facilitates coalesced memory access for all matrix or vector multiplications and additions involved. Coalesced memory access on GPUs, are shown to provide better performance in [

In this section, we analyze the effectiveness of AOPEA, its parallelization, and scalability. To analyze AOPEA, first, we compare the execution time of sequential implementation of AOPEA with a reference pose estimation algorithm, and sequential OPEA algorithm. For the reference pose estimation, we use the continuous pose estimation algorithm (CPEA). Second, we look at the execution time of parallel implementations of CPEA, OPEA, and AOPEA parallelized using standard parallel libraries. Lastly, we look at the execution time of our parallel implementation of AOPEA, analyzing its scalability.

The programs are executed on Intel Xeon CPU, having two E5620 processors operating at 2.40 GHz and running a 64 bit Windows 7 Pro Operating System. The sequential programs are executed on a 64-bit MatlabR2014b software. Matlab’soriginal core has been developed from LINPACK and EISPACK [

For parallel program execution, the CPU is equipped with a NVIDIA Tesla C2075 card. The card is equipped with 448 cores and 6GB of memory for general computations. Though K20x and K40m cards seem to be a good option for GPU parallelization, we limit ourselves to a low-cost GPU such as C2075. For the development of parallel AOPEA implementation, NVIDIA’s CUDA 6.0 integrated with Microsoft Visual 2015 via NSight was used. For developing parallel CPEA, OPEA and AOPEA code using standard libraries, cuBLAS library package was used. cuBLAS library package is an accelerated Basic Linear Algebra Subprograms (BLAS) library provided by NVIDIA for GPUs. A combination of APIs available in the cuBLAS package is used to perform the operations in CPEA, OPEA, and AOPEA. Lastly, for profiling NVIDIA’s Visual Profiler, NVVP, com- patible with CUDA 6.0 and Microsoft Visual 2015 was used to profile the code.

For the purpose of this paper, we use execution time as a measure of performance. Low execution time is considered to be better. Each implementation of an algorithm (CPEA, OPEA or AOPEA) for a given marker size and accuracy is considered as one simulation and the execution time is collected. Each simulation is executed one thousand times and an average execution time (AET)is computed. The standard deviation of AET for all simulations was observed to be under 3%. Each simulation has been verified by reconstructing objects in images. Data, for verification of algorithms, is taken from videos under indoor computer lab conditions using 60 fps at 1080 p resolution. The number of markers for simulations is varied to study the scalability of the algorithms, i.e., we choose

_{f}, the CPEA has lower AET, whereas, for larger m_{f}, the CPEA has a non-linear increase in AET. This is due to the computations in CPEA being_{f}, AET for CPEA is low and grows exponentially as m_{f} increases. The computations of OPEA, as seen in _{f} values. For AOPEA, the AET is higher than CPEA for lower values of m_{f} values. This is due to the overhead in v independent computations that are pre-computed before the iterations. However, as v increases, there is a proportionate increase in AET. This is due to the iterative computations of AOPEA being linearly proportional to m_{f}, as seen in

Next, we look at the parallel implementation of CPEA, OPEA and AOPEA, implemented using cuBLAS library API calls. The results, which include the data transfer time between CPU and GPU, are presented in

Markers per Frame | AET (msecs) | ||
---|---|---|---|

CPEA | OPEA | AOPEA | |

32 | 0.32 | 1.52E+05 | 97.70 |

64 | 1.10 | 2.97E+05 | 102.50 |

128 | 1.80 | 5.84E+05 | 118.08 |

256 | 8.50 | 1.18E+06 | 146.14 |

512 | 41.60 | 2.31E+06 | 194.74 |

1024 | 824.50 | 4.66E+06 | 296.45 |

2048 | 70826.70 | 1.02E+07 | 503.71 |

good scaling till 512 m_{f}. This is because CPEA has sufficient parallelism in its code. However, for m_{f} > 512 the profiler indicated occupation of all cores on the GPU. This forces fragments of code in the queue to wait till cores become idle serializing the execution, increasing the AET. In the case of OPEA, the algorithm is embarrassingly parallel with respect to iterations i.e., the number of iterations is more than 90,000 and independent of one another. Hence, the problem has enough workload for GPUs even for low m_{f}. The profilers indicate that the GPU cores are completely occupied. But, as m_{f} increases, due to the lack of idle GPU cores, more code execution is serialized leading to higher AET. In case of AOPEA, with lower computations than CPEA and OPEA and increasing linearly with increase in m_{f}, the AET shows small increments for every doubling of m_{f}. Due to the separation of v dependent and independent computations in AOPEA, the data needs to be re-arranged after completing the v independent computations. This forces additional data transfers between CPU and GPU. For low values of m_{f}, the AET is limited by the overhead of data transfers between CPU and GPU. In fact, the data transfers contribute to nearly 90% of AET for all values of m_{f}. However, despite overhead of data transfer time, AOPEA has the best AET for higher values of m_{f}.

_{f} values. Due

to a single data transfer between CPU and GPU in each direction, and highly optimized code, the AET is found to be just over a millisecond for 2048 markers. The implementation also shows good scalability even at 2048 markers, unlike _{f}. In case of OPEA, we observe a linear speed up with an increase in markers. Whereas for AOPEA, the speed up saturates at about 250×. The low AET and good scalability of our parallel implementation of AOPEA indicate its suit ability for real-time applications.

Though our version of pose estimation shows low AET, using it for real-time applications may have higher AET. This is because real-time applications combine pose estimation with tracking of markers. This would involve additional overheads. For example, real-time applications would use a video, where each frame is considered as an image. The image from the camera needs to be transferred to CPU, and then to the GPU. The image may also need to be pre- processed to obtain distinct position of markers. In order to show that our parallel

implementation can be used for real time applications, we conducted a lab experiment. A 1000 fps high-resolution camera was used to capture a video tracking 32 markers. The position of the 32 markers were pre-calculated, but fed to the pose estimation algorithm in real time on a per image basis. Simulations for different image resolutions were conducted, where each image obtained from the camera, was copied to CPU, transferred to GPU, converted from Red-Green- Blue format to gray-scale format, and then pose was estimated. Using simulations’ AET, the supported fps that our pose estimation algorithm could process, was calculated.

We have modified the algorithm provided in [

requirements of real-time applications. An analysis of the implementation indicated that there was a) redundancy in computations and b) data and code organization un-fitting for GPU architectures. We modified the implementation, to suit parallelization on GPU architectures, in two phases: first, refactoring the algorithm to have lesser number of operations and enhanced parallelism, and secondly, optimizing the data and code to obtain better parallelism for GPU architectures. We compared the effectiveness of our algorithm AOPEA, with CPEA and OPEA, for sequential and parallel implementations. For sequential implementation, AOPEA performed much better than its predecessor algorithm OPEA. However, for lower markers per frame, CPEA performed better than AOPEA whereas for higher markers per frame AOPEA performed better than CPEA. To understand the effectiveness of our parallel implementations, CPEA, OPEA and AOPEA were parallelized using the cuBLAS library and compared with our parallel implementation. The results showed that our parallel implementation of AOPEA has lowest execution time. Moreover, our parallel implementation also showed good scalability of performance with an increasing num- ber of markers per frame. Moreover, a lab simulation of a real-time application indicated that our parallel implementation of AOPEA supports at least 100 fps even for high-resolution videos. Hence, our parallel implementation of AOPEA could be implemented on GPUs for real-time applications using high-resolution frames with a high number of markers per frame.

For our future work, we plan on pursuing two additions to the AOPEA: a) multi-GPU implementation of the algorithm for images from 3 - 4 cameras, and b) integrating our AOPEA and multi-GPU AOPEA algorithm with a tracking algorithm.

Kumar, R.R.P., Muk- nahallipatna, S.S. and McInroy, J.E. (2016) Acceleration of Points to Convex Region Cor- respondence Pose Estimation Algorithm on GPUs for Real-Time Applications. Journal of Computer and Communications, 4, 1-17. http://dx.doi.org/10.4236/jcc.2016.417001