Jump to content


Photo
- - - - -

compression on cuda


  • Please log in to reply
11 replies to this topic

#1 karthika

karthika

    Lepton

  • Members
  • 5 posts

Posted 13 September 2012 - 06:42 PM

is compression algorithms on CUDA a good area to doing project as part of mtech course?
please suggest some papers related to this topic.
  • 0

#2 Enthalpy

Enthalpy

    Primate

  • Senior Members
  • 2,246 posts

Posted 16 September 2012 - 01:44 PM

Why not.

But if you want your project to be useful, it should fit normal users, not exotic research:
  • Work on BOTH nVidia AND Amd processors. I wouldn't try a soft that I lose if I change my card.
  • Works on normal graphics cards, not on specialized parallel processors that nobody has.
  • Use standard compression formats like zip, rar, gzip, 7zip...
  • As you won't write a full user interface (that's a different big project) your software should interface with an existing archiver like 7zip, IZArc... 7zip is an open source project, so people there publish their interfaces and may welcome you.
  • Be faster than a four-core Cpu, on which 7zip uses the SSE instructions and hyperthreading on all cores...
Under these conditions, of which I don't see any single one as optional, your compressor would be useful.

Other pieces of software that would benefit from Gpgpu:
  • An antivirus file parsing algorithm. At least one antivirus editor (Kaspersky?) has begun doing it.
  • Cryptography. Less new.
The two latter are harder to interface with an existing software as these tend to be commercial and non-public, so such a project would be rather academic with little usefulness.

One other use, very specialized but badly needed, is a step in the design of lithography masks for integrated circuits production. I believe it's the splitting of patterns into rectangles - sorry it's old story for me. It was a two-weeks task on the biggest computers 20 years ago, and still is today, because computers evolve as slowly as chips do. Costly, and wastes two weeks (often several times) in the time to market, so if you find a good implementation on a many-Gpu machine, the few users will be very interested and angry to interface your implementation in their programme.

Edited by Enthalpy, 16 September 2012 - 01:51 PM.

  • 0

#3 karthika

karthika

    Lepton

  • Members
  • 5 posts

Posted 17 September 2012 - 03:36 PM

I am satisfied with your reply...can you suggest more..
  • 0

#4 Enthalpy

Enthalpy

    Primate

  • Senior Members
  • 2,246 posts

Posted 18 September 2012 - 10:59 PM

Not so easy, because programmes benefit more from subtle algorithms than from computing speed, and subtle algorithms tend to run horribly on a Gpu...

But on such a parallel machine (a Gpu) you could try to implement
  • An inference engine
  • A Prolog interpret (if this is still any fashionable)
as both exhibit a regular uniform parallelism accessible to the machine. It could be less banal than image processing or finite elements.
Though, I don't know Cuda enough to tell if it's the right layer for such a purpose. Maybe you need first your own layer, different from Cuda.

==================================

You might have a look at error-correcting codes (ECC). A Gpu helps a lot decoding them, but... Codes are full of maths. Not complicated for a mathematician :rolleyes:: Galois fields and statistics. Telecom engineers achieve to understand them (...more or less :P); with a software background, you would need to have studied this particular math or have a real passion and ease for it.
http://en.wikipedia....correcting_code

You should first check what is already done on Gpu, because ECC specialists program Digital Signal Processors and are easy with software.

Some decoding tasks that might benefit from a Gpu:
  • Decoding bigger Reed-Solomon codes. This means just solving a big set of linear equations, where linear means in a Galois field instead of float numbers. Please check if the Gpu can efficiently access random table elements, since these computations rely on (Galois field) logarithm tables.
  • Decoding of concatenated codes. As they run many codes in parallel, they may fit a Gpu - but often rely on Galois field computations.
  • Decoding of random codes. These codes are optimized for correction efficiency but have no fast formal decoding algorithm, so a Gpu may enable them.
  • Soft-decoding of block codes. Soft means here that the received bits aren't hard-decided to 1 or 0 prior to error correction, but given a continuous value or a probability - it improves the correction performance. Fast and imperfect algorithms exist and work for "convolutional codes", but algorithms correcting block codes properly would perform a lot better. Here near-brute force algorithms may be best (please let check), comparing the probability of all allowed and credible combinations of error positions; they'd need massive computation on float numbers, which a Gpu would enable. Easier maths!
  • Soft-decoding of random codes. Of course. Even more so because they are high-performance block codes and no good decoding method exists. As soft-decoding looks no more difficult for them than hard-decoding, grasp the opportunity. Easier maths!
  • Soft-decoding of turbo codes. I suppose it exists already. The common description of turbo codes claims bits are 0 or 1 (hard decision) and are flipped when decoding along alternate directions; instead, you could attribute each bit a probability and alter it softly according to the code in each direction, which would bring a brutal improvement to the correction capability. Since turbo codes let many codes run in parallel, they could fit a Gpu.
  • Soft-decoding of turbo codes that combine random codes. Of course.
I suppose these nearly-brute force methods will stay rather slow even on a Gpu, so the typical use wouldn't be fiber optics transmissions, but rather deep space crafts. Improving data comms for mythical probes like Voyager or Pioneer would of course be fascinating, knowing that the Square Kilometer Array is under construction... Or for planned probes, sure.
http://en.wikipedia....wiki/Pioneer_10 and http://en.wikipedia....wiki/Pioneer_11
http://en.wikipedia..../Voyager_1 and http://en.wikipedia.org/wiki/Voyager_2
http://en.wikipedia....ce_Network and http://en.wikipedia....Kilometer_Array

Marc Schaefer, aka Enthalpy

==================================

Some big science projects may benefit from Gpu and still not use them... Perhaps!
  • The LHC lets a huge Internet grid of remote computers analyze and store the data collected by the detectors. I suspect it runs on the Cpu of these computers. Use their Gpu, if it's fast, and if the task fits it?
  • LHC's programmed successors?
  • The Square Kilometer Array needs processing power and is to rely on a grid of remote computers.
  • Neutrino detectors, dark matter detectors, gravity waves detectors...
  • DNA sequencing
Smaller science or engineering:
  • Compute the shape of a molecule, both from force fields or from first principles
  • Compute the arrangement of (stiff...) molecules in a solid
  • Compute band diagrams in crystals
In all these applications, first check what is already done on a Gpu (protein folding, Seti and many more are) and what the people involved desire. Also, as nearly all run on Linux, Cuda may not be the proper layer... A Linux equivalent, or maybe OpenGL.
  • 0

#5 Enthalpy

Enthalpy

    Primate

  • Senior Members
  • 2,246 posts

Posted 19 September 2012 - 02:08 PM

For the inference engine and the Prolog interpret (or interpreter), you could try to take the opinion of Yann Lecun. He speaks English and worked at New York university last time I heard from him. He was previously involved in artificial intelligence and special hardware.

He might well suggest to program a neural network on the Gpu (he worked on machine learning by gradient back-propagation). I just fear this is banal.

-----

For the error-correction codes, if you can associate with a telecom engineer, then programming the soft-decoding of block codes is accessible to a software engineer.
  • 0

#6 jimmyhelu

jimmyhelu

    Quark

  • Members
  • 12 posts

Posted 23 September 2012 - 01:40 PM

Typically compression algorithms cannot make use of parallel tasks, it is not easy to make the algorithms highly parallelizeable. In your examples, TAR is not a compression algorithm, and the only algorithm that might be highly parallelizeable is BZIP because it is a block compression algorithm. Each block can be compressed separately, but this would require lots and lots of memory. LZMA does not work in parallel either, when you see 7zip using multiple threads this is because 7zip splits the data stream into 2 different streams that each are compressed with LZMA in a separate thread, so the compression algorithm itself is not paralllel. This splitting only works when the data permits this.

<br class="Apple-interchange-newline">
  • 0

#7 Enthalpy

Enthalpy

    Primate

  • Senior Members
  • 2,246 posts

Posted 23 September 2012 - 03:04 PM

A much simpler project now. I gratefully use a file navigator called FreeCommander which can compare two folders in order to synchronize their contents (and does much more). It tells: these files are identical, others differ, these are missing - and then: which ones do you want to update or copy.

I misuse it frequently on huge folders, like collections of graphics drivers, and then it gets slow and the CPU is the bottleneck, not the SSD. Easy to improve with a GPU.

Even better: I often rename folders or reorganize them, which makes the answer by FreeCommander little useable. I wish it answers "this file was just renamed" or "this subfolder was moved there". It would need to compute hashes of the files to later compare many ones more quickly, but FreeCommander is slower at computing hashes. If you define your hash algorithm properly, it can be parallel even on one file, and there are several files anyway, so a GPU can improve that.

Note: comparing just the very beginning of the files even in a first step won't suffice, since many ones begin with "I'm a Jpeg picture".
Note: the hash function must give a unique value but needs no cryptographic strength.
Note: on any nVidia or Amd graphics card, not an exotic board - but you might restrict to dX10 hardware and above.
Note: you can compute independent hash value for each 512B or 4kiB block and leave the CPU compare the hash. Flexibler for the CPU which can cut unnecessary work, more parallel for the GPU.

Many similar navigators and synchronizers exist, FreeCommander is just the one I use. Its author is a nice guy and speaks many languages, so if you develop the file comparison or hash on GPU and take contact with him he might integrate it in FreeCommander.

Marc Schaefer, aka Enthalpy

Edited by Enthalpy, 23 September 2012 - 03:11 PM.

  • 0

#8 Enthalpy

Enthalpy

    Primate

  • Senior Members
  • 2,246 posts

Posted 23 September 2012 - 09:41 PM

Alas! A hash value for such purpose is too easy to compute, and already the CPU is faster than any disk. That way :

for (a = successive 16-bit slices in a cluster) {
f = float32 (a);
s1 = f + 0.9990*s1;
s2 = f + 0.9991*s2;
s3 = f + 0.9992*s3;
s4 = f + 0.9994*s4;
}

The four float32 s1, s2, s3, s4 make the hash value of one 4096B cluster. Good enough to compare files.

A GPU with 128 MAC at 1500MHz would have hashed files at 96GB/s but the video Ram limits to some 50GB/s and a Pci-E 2.0 *16 to 8GB/s...
One SSE crunches a slice per clock, so a 3GHz quad-core hashes 24GB/s but the Ram limits...
And a fast SSD limits to 0.5GB/s, blistering barnacles!

So while I'm still interested by the added functions on file explorers like FreeCommander, the task is too easy for a GPU. Obviously, any O(n) computation will stumble at data throughput.

Back to the error-correcting codes then...
Marc Schaefer, aka Enthalpy

Edited by Enthalpy, 23 September 2012 - 09:46 PM.

  • 0

#9 Enthalpy

Enthalpy

    Primate

  • Senior Members
  • 2,246 posts

Posted 26 September 2012 - 11:46 AM

I suggested four 32-bits floats to compute hash values quickly on all GPU, but if using the SSE, MMX or a wide FPU, two 64-bits floats are as good or better. Convert 32-bits slices of the files to be hashed then.
  • 0

#10 karthika

karthika

    Lepton

  • Members
  • 5 posts

Posted 1 February 2013 - 08:03 AM

comparison of two algorithms using CUDA is better approach for research
project.then how can i implement this?should i implement all algorithms?



can i proceed with BZIP on CUDA?then please give the details how to do this as a thesis project.buying GPU is  feasible

please give the procedure and approximate cost of the processor.



LZW compression algorithm on CUDA already implemented?is it feasible?any improovent than with cpu ?


  • 0

#11 Enthalpy

Enthalpy

    Primate

  • Senior Members
  • 2,246 posts

Posted 7 March 2015 - 10:54 PM

I suggested on 18 September 2012 to run a so-called "soft-decision" on block codes like Golay and this spreadsheet finds a seducing sensitivity
Attached File  SoftBlockReedSolomon_v1.zip   6.83KB   10 downloads

A Reed-Solomon code that corrects up to 209 errors in 4095 12-bit words combines with soft-decision of a block code, here an extended Golay with its 12 bits information. The Reed-Solomon would lose just 0.1dB if it can correct 102 errors, 0.2dB for 72 errors. I compare the receiver's sensitivity with a Bpsk transmission of same error rate without correction and same information rate (since the redundancy increases the receiver's bandwidth hence noise, many books err at this point), and

  • The soft Golay fills 45% of the throughput with information but gains 11.1 dB, wow.
  • The Reed-Solomon alone would gain 8.6dB, a soft-decoded Hamming(18,12,4) 9.3dB.
  • The extreme case Hadamard(4096,12,2048) would gain 12.8dB but fill very little.

On a block code, soft-decision simply permits to integrate over a longer time (the code distance) the differences between the codewords and obtain a complete symbol. This makes the gain, despite the result must be distinguished from many neighbour codewords.

I didn't compare the figures with a turbo-decoding scheme. A Golay here isn't so big but it combines nicely with the Reed-Solomon, and its soft-decision is optimum.

Maybe I find again my old software to produce so-called random codes. Since no decoding method is needed, they fit well here. A code longer than Golay may be useful, and some with more than 12 bits information too.

----------

Correlating the 24 received sample with all 4096 codewords of a Golay to decide the best symbol takes 98e3 Mul-acc if squandering a processor's multiplier for the +1 and-1; just use conditional add if it's faster or saves power.

Alternately, we could precompute all sums and differences for groups of four samples for instance and only combine these precomputed values, or make even more subtle tricks, but this fits an Avx or Gpu less well and may be slower. A clearer option is a special chip that adds many small integers conditionally; with unoptimized 4096 adders on 20 bits it soft-decides a symbol in 24 cycles.

  • A 2.5GHz dual-core Sse achieves 20e9 Mul-Acc/s on 32b floats to make the soft-decision for 200e3 symbols/s or 270kB/s information.
  • Two 3.5GHz six-core Avx (336e9 Mul-Acc/s) soft-decide 4.6MB/s.
  • One video card (Gtx 750 ti: 653e9 Mul-Acc/s) soft-decides 9MB/s.
  • Two video cards (Gtx Titan Z: 8122e9 Mul-Acc/s) soft-decide 123MB/s.

So while the Gpu isn't always necessary, it enables a higher data rate or a cheaper computer. Bigger codes, which gain sensitivity, need much more compute power.

Marc Schaefer, aka Enthalpy

 


  • 0

#12 Enthalpy

Enthalpy

    Primate

  • Senior Members
  • 2,246 posts

Posted 8 March 2015 - 12:56 PM

About random codes: most authors understand it as codes chosen randomly. This was not my intention. Here I meant: codes strongly optimized, that follow no theory (no BCH for instance) hence have no efficient decoding algorithm.

 

For instance, a BCH(15,7,5) is less than optimum (never believe a textbook) as its 2*4 bits redundancy suffice to distinguish two errors at positions (i,j) from (j,i) which is useless - but it has a decode algorithm. As opposed, my old software (found it again, but too long to adapt to other sizes) finds codes(16,8,5) which have a more convenient size, are more performant - but have no general decode method. While decoding tables have 256 bytes for the codes(16,8,5), they're impractical for bigger codes.

 

By the way, my (16,8,5) aren't bad when soft-decided and combined with a 255 bytes long Reed-Solomon; easy job for a computer. If the Reed-Solomon can correct 26 errors, the combination gains 9.8dB over a codeless Bpsk, while an 18 errors capability is only 0.1dB worse and 15 errors 0.2dB.

 

With this "short" Reed-Solomon, an additional external code seems to bring something. It wasn't the case with 4095 symbols.

 

This is less than Golay's 11.1dB, partly because the code is smaller, and also because the (16,8,5) isn't perfect: it approaches (16,8,6),  soft-decoding takes advantage of it but this is neglected here. Bigger theoryless codes are thus interesting.


  • 0




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users