Jump to content

Data correction methods resistant to pessimistic cases


Duda Jarek

Recommended Posts

Standard data correction methods, has some maximal number of errors they can correct.

For example Hamming (7,4) uses 3 additional checksum bits for 4 bits of information - it works fine when there is at most 1 error per block of 7 bits.

We use quite large redundancy, but is it safe now? The problem is with pessimistic cases - if expected error rate is 1/100 bits, still quite often it can happen that we have 2 errors in some of 7 bit blocks.

 

I would like to propose some statistical approach to data correction, which allow to protect against such pessimistic cases.

Thanks of this we can for example reduce redundancy to achieve similar safeness.

 

The trick is to use a very precise coding - such that any error would make that the following decoded sequence should be completely random 0/1 sequence (p(0)=p(1)=1/2).

For example a block ciphers, which uses previous block to calculate the following one, but there is much better coding for it I will say later about.

 

Now - add to the information some easily recognizable redundancy - for example insert '1' between each digit.

If while decoding it occurs that there is '0' in one of these places - that means we had some error before.

Knowing statistical characteristics of expected errors, we can make list of most possible errors in such cases, ordered by their possibility - on the top of this list there should be 'switched previous bit', ... after a while there can appear 'switched two bits:...'. This list can be very large.

Now if we know that there (nearby) appeared some error, we take this list position by position, correct as it was really this case (switch some bits) and try to decode further fixed number of bits (a few dozens).

If everything is ok - we get only '1' on selected positions - we can assume that it was this error.

If not - try the next one from the list.

This list can be generated online - using large amount of time we could repair even badly damaged transmission.

While creating the list, we have to remember that errors can appear also in succeeding bits.

 

Using block ciphers is a bit nasty - slow, we have large blocks to find errors ...

There is new coding just ideal for above purpose - Asymmetric Numeral Systems (ANS) - new entropy coder, which has very nice properties for cryptography ... and data correction - it's much faster than block ciphers and uses small blocks of various length.

Here for example is demonstration about it:

http://demonstrations.wolfram.com/DataCompressionUsingAsymmetricNumeralSystems/

 

What do You think about it?

Edited by Duda Jarek
Link to comment
Share on other sites

I think the state-of-the-art in this regard is erasure coding, using methods like Vandermonde Matrices.

 

This allows for extremely pessimistic cases by incorporating a high degree of redundancy. For example Tahoe, a secure distributed "global" filesystem, uses erasure coding to safely distribute files across a number of peers which are assumed unreliable. Depending on how you configure it, only a fraction of the peers are needed to successfully reconstruct the file (1/4, I believe).

 

The blocks these algorithms operate on are much larger than individual bytes. But still, only having 1/4 of the original data available is a pretty pessimistic case, and erasure coding deals with it just fine (at the cost of quadrupling the size of your stored data).

Link to comment
Share on other sites

You mean encoding polynomial coefficients by values in more than degree number of points? I agree that it's great method but still pessimistic local arrangement of errors destroys whole block. Besides is very slow in decoding...

 

This standard approach takes some blocks of the data and enlarge it to protect against some specific set of errors. We loose whole block if we get out of this set.

I want to show that it's not the only way - that we can use not a small block to locate errors inside, but be able to use potentially all succeeding bits.

So even if errors creates some pessimistic pattern, like clustering around a point, investing a large amount of time we could still be able to repair it.

Link to comment
Share on other sites

You mean encoding polynomial coefficients by values in more than degree number of points? I agree that it's great method but still pessimistic local arrangement of errors destroys whole block.

 

It'd require errors be scattered across multiple blocks. This technique has been used quite successfully in all sorts of fields, namely "forward error correction" employed by cell phones, satellite transmissions, HDTV, Joost, and systems like Tahoe. It's pretty much the de-facto error correction system employed today for one-way transmissions where schemes like ARQ aren't possible.

Link to comment
Share on other sites

It'd require errors be scattered across multiple blocks...

But it still can happen... and it's slow and maybe we could use less redundancy to achieve similar safeness...

 

We are adding constant density of redundancy, but errors doesn't have to come with constant density - it can fluctuate - sometimes is above average, sometimes beyond.

If are above, they can exceed safe amount that our redundancy can cope with.

If is beyond - we've placed there more redundancy than it was required - we waste some capacity.

I'm saying that we could transfer these surpluses to help with difficult cases!

 

To do it we shouldn't separate information by placing it in blocks.

It have to be one stream that can say that something has just been wrong - we don't see the pattern(redundancy) we've placed there - we have to try to fix neighborhood of this point until the pattern emerge again as it should.

Link to comment
Share on other sites

We are adding constant density of redundancy, but errors doesn't have to come with constant density - it can fluctuate - sometimes is above average, sometimes beyond.

 

When the density is constant the latency is constant. You'd be hard pressed to find a scheme whereby you could use error correcting bits elsewhere in the stream to combat the extremely pessimistic case of minor corruption of several blocks in a row while still guaranteeing low latency.

 

It all depends on what your use case is: are you trying to combat corruption in transmission, or corruption on disk? Also: is latency a factor? In all the cases I gave above low latency is generally quite important.

 

To do it we shouldn't separate information by placing it in blocks.

 

Well again, it depends on the use case: are you worried about data in transit, or data on disk?

 

In either case, the data is decomposed into blocks by the underlying mechanism. In the case of data transmission, the data is generally packetized, and in the case of data on disk it's broken down into sectors.

 

Corruption of transmitted data will generally occur at the packet level: the entire packet will be dropped. These are the cases where block-based schemes like erasure coding are quite handy.

 

The same thing often occurs with data on disk: the hard drive will be completely unable to read a sector. Instead of getting back a chunk of corrupted data, you get no data because a read error occurred.

 

So the question becomes what use cases do you envision where you're actually getting back corrupted data as opposed to missing a chunk of data?

 

 

It have to be one stream that can say that something has just been wrong - we don't see the pattern(redundancy) we've placed there - we have to try to fix neighborhood of this point until the pattern emerge again as it should.

Link to comment
Share on other sites

Ok - latency is not good side of the scheme I've presented - simple errors can be quickly corrected, but large ones may need a lot of time...

There is also problem with loosing large block of data... using ANS it's a bit problematic, but actually we can start decoding again after it.

Unfortunately we are of course loosing it's content.

To protect against loosing whole packets scenario, we can for example - place first let say 100 bits as the first bit of 100 first packets, next 100 bits as the second one and so on... now we have to buffer these 100 packets before we can start decoding.

 

By blocking, I meant placing information in completely independent (eg. 7bit in Hamming) blocks - thanks of it we can easily assure short, constant latency, but we cannot 'transfer the surpluses of redundancy' to cope with fluctuations of error density because each block has independent redundancy.

I agree that because of various latency it's rather unpractical for telecommunication or memories but may be useful for example for archives, which just have to survive long time...

 

And maybe there are possible faster methods which allows to such redundancy transfers?

Thanks of this, we could use smaller amount of redundancy - not according to pessimistic density of errors, but only a bit above average density - it's usually a few orders of magnitude smaller...

Link to comment
Share on other sites

  • 1 month later...

We can use ANS entropy coding property to make above process quicker and distribute redundancy really uniformly:

to create easily recognizable pattern, instead of inserting '1' symbol regularly, we can add a new symbol - the forbidden one.

If it occurs, we know that something was wrong, the nearer the more probable.

 

Let say we use symbols with some probability distribution (p_i), so we at average need H = -sum_i p_i lg p_i bits/symbol.

For example if we want just to encode bytes without compression, we can threat it as 256 symbols with p_i=1/256 (H = 8 bits/symbol).

 

Our new symbol will have some chosen probability q. The nearer to 1 it is, the larger redundancy density we add, the easier to correct errors.

We have to rescale the rest of probabilities: p_i ->(1-q) p_i.

In this way, the size of the file will increase r = (H - lg (1-q) )/H times.

 

Now if while decoding we get the forbidden symbol, we know that,

- with probability q, the first uncorrected yet error has occurred in some of bits used to decode last symbol,

- with probability (1-q)q it occurred in bits used while decoding the previous symbol,

- with probability (1-q)^2 q ...

 

The probability of succeeding cases drops exponentially, especially if (1-q) is near 0.

But the number of required tries also grows exponentially.

But observe that for example all possible distributions of 5 errors in 50 bits is only about 2 millions - it should be checked in a moment.

 

Let's compare it to two well known data correction methods: Hamming 4+3 (to store 4 bits we use additional 3 bits) and tripling each bit (1+2).

Taking the simplest error distribution model - for each bit the probability that it is switched is constant, let say e=1/100.

So the probability that in 7 bit block we have at least 2 errors, is

1 - (1-e)^7 - 7e(1-e)^6 =~ 0.2%

For 3 bit block it's about 0.03%

So for each kilobyte of data we irreversibly loose: 4*4=16 bits in Hamming 4+3, 2.4 bits for tripling bits.

We see that even for looking to be well protected methods, we loose a lot of

data because of pessimistic cases.

 

For ANS based data correction:

4+3 case (r=7/4) - we add forbidden symbol with probability q=1-1/2^3=7/8, and each of 2^4=16 symbols has probability 1/16*1/8=1/128.

In practice ANS works best if lg(p_i) aren't natural numbers, so q should (not necessary) be not exactly 7/8 but something around.

Now if the forbidden symbol occurs, with probability about 7/8 we only have to try to switch one of (about) 7 bits used to decode this symbol.

With 8 times smaller probability we have to switch 7 bits from the previous one... with much smaller probability, depending on the error density model, we should try to switch some two bits ... and even extremely pessimistic cases looks to take reasonable time to correct them.

For 1+2 case (r=3), the forbidden symbol has about 3/4, and 0,1 has 1/8 each.

With probability 3/4 we have only to correct one of 3 bits ... 255/256 one of 12 bits ...

 

-----

 

There is some problem - in practice coding/decoding tables should fit into cache, so we can use at most about million of states.

While correcting trying thousands of combination, we could accidentally get the correct state with wrong correction - a few bits would be corrected in wrong way and we wouldn't even notice it.

To prevent it we can for example use two similar stages of ANS - the first creates bytes and the second convert the first to the final sequence.

The second would get uniformly distributed bytes, but ANS itself creates some small perturbations and it will work fine.

Thanks of this the number of states grows to the square of initial one, reducing this probability a few orders of magnitude at the cost of double time requirements.

We could use some checksum to confirm it ultimately.

Link to comment
Share on other sites

I've just realized that we can use huge freedom of choice for the functions for ANS to improve latency - we can make that if the forbidden symbol occurs, we are sure that if there was only single error, it was among bits used to decode this symbol.

Maybe we will have to go back to the previous ones, but only if there were at least 2 errors among these bits - it's an order of magnitude less probable than previously.

The other advantage is that if we would try to verify wrong correction by trying to decode further, single error in block will automatically tell us that it's wrong correction. There could be 2 errors, but they are much less probable, we can check it much later.

 

The trick is that the forbidden symbol usually dominate in the coding tables, so we can make that if for given transferred bits we would get allowed symbol, for each sequence differ on one bit (Hamming distance 1) we would get the forbidden symbol.

 

So to make the initialization, we choose some amounts of the allowed symbols and we have to place them somehow.

For example: take unplaced symbol, place it in random unused position (using list of unused positions), and place the forbidden symbol on each state differing on one bit of 'some' last ones.

This 'some' is a bit tricky - it has to work assuming that previously only allowed symbols were decoded, but it could be any of them.

If we are not making compression - all of them are equally probable, this 'some' is -lg(p_i) plus minus 1. Plus for high states, minus for low.

 

There should remain some states unused after this procedure. We can fill them with forbidden symbols or continue above procedure, inserting more allowed symbols.

This random initialization still leaves huge freedom of choice - we can still use it to additionally encrypt the data, using random generator initialized with the key.

If want data correction only, we can use that in this procedure many forbidden symbols are marked a few times, the more the smaller the output file ... with a bit smaller but comparable safeness.

So we could consciously choose some good schemes, maybe even that uses Hamming distance 2 (or grater) - to go back to previous symbol there would have to occur 3 errors.

 

For example 4+3 scheme seems to be perfect: we transfer at average 7 bits, and for every allowed symbol there occurs 7 forbidden ones.

For some high states like 111******** (stars are the transferred bits) we have to place 8 forbidden symbols, but for low like 10000****** we can place only six.

Some of forbidden states will be marked a few time, so we should make whole procedure, eventually use a bit less amount allowed symbols (or more).

Link to comment
Share on other sites

I've just realized that Hamming, tripling bits are some special (degenerated) cases of ANS based data correction :)

In the previous post I gave arguments that it would be beneficial if any two allowed states would have Hamming distance at least 2.

If we would make that this distance is at least 3, we could unambiguously instantly correct single error as in Hamming.

 

To get tripling bit from ANS we use:

states from 1000 to 1111

Symbol '0' is in 1000 state, symbol '1' is in 1111 (Hamming distance is 3) and the rest six states have the forbidden symbol.

We have only 1 appearance of each allowed symbol, so after decoding it, before bit transfer the number of state will always drop to '1' and three youngest bits will be transferred from input.

 

To get Hamming 4+3,

states are from 10000000 to 11111111

We have 16 allowed symbols from '0000' to '1111', each one has exactly one appearance - the state 1*******, where stars are 7 bits it would be coded in Hamming - two different has Hamming distance at least 3.

After coding the state drops to '1' again and this '1' will be the oldest bit after bit transfer.

 

The fact that each allowed symbol has only one appearance, makes that after decoding we each time drops to '1' - it's kind of degenerated case - all blocks are independent, we don't transfer any redundancy.

It can handle with great error density, like 1/7 (for Hamming 4+3) ... but only while in each block is at most 1 error.

In practice errors doesn't come in such regularity and even with much smaller error density, Hamming looses a lot of data (like 16 bits per kilobyte for 0.01 error probability).

 

Let's think about theoretical limit of bits of redundancy we have to add for bit of information for assumed statistical error distribution to be able to full correct the file.

To find this threshold, let's think about simpler looking question: how many information is stored in such uncertain bit?

Let's take the simplest error distribution model - for each bit probability that it's switched is equal e (near zero), so if we see '1' we know that with probability 1-e it's really '1', and with probability e it's 0.

So if we would know which of this cases we have, what is worth

h(e)=-e lg(e) - (1-e) lg(1-e),

we would have whole bit.

So such uncertain bit is worth 1-h(e) bits.

So to transfer n real bits, we have to use at least n/(1-h(e)) these uncertain bits - the theoretical limit to be able to read a message is (asymptotically)

h(e)/(1-h(e)) additional bits of redundancy /bit of information.

 

So a perfect data correction coder for e=1/100 error probability, would need only additional 0.088 bits/bit to be able to restore message.

Hamming 4+3 instead of using additional 0.75 bits/bit, looses 16bits/kilobyte with the same error distribution.

 

Hamming assumes that every 7bit block can come in 8 ways - correct or with changed one of 7 bits.

It uses the same amount of information to encode each of them, so it would add at least lg(8 )=3 bits of redundancy in each block - we see it's done optimally...

... but only if the probability of all of this 8 ways would be equal for this error distribution...

In practice the most probably we would have the possibility without error, later with one error ... and with much smaller possibilities with more errors ... depending how does error distribution in our medium looks like.

 

To go into the direction of the perfect error correction coder, we have to break with uniform distribution of cases like in Hamming and try to correspond to real error distribution probabilities.

 

If the intermediate state for ANS based data correction could have many values, we would transfer some redundancy - the 'blocks' would be somehow connected and if in one of them would occur more errors, we could use this connection to see that something is wrong and use some unused redundancy from succeeding blocks to correct it - we use the assumption that according to error distribution, the succeeding blocks are with large probability correct.

We have huge freedom while choosing ANS parameters to get closer to the assumed probability model of error distribution ... to the perfect data correction coder.

Edited by Duda Jarek
Link to comment
Share on other sites

  • 4 months later...

I've just finished large paper about ANS. There is added some deeper analysis, gathered rethinked information I've placed on different forums...

 

There is also shown that presented data correction approach can really allow to reach theoretical Shannon's limit and looks to have expected linear time, so should be much better than the only used such practical method - Low density Parity Check (LDPC)

http://arxiv.org/abs/0902.0271

Link to comment
Share on other sites

  • 4 months later...

The simulator of correction process has just been published on Wolfram's page:

http://demonstrations.wolfram.com/CorrectionTrees/

Is shows that we finally have near Shanon's limit method working in nearly linear time for any noise level.

 

For given probability of bit damage (p_b), we choose p_d parameter. The higher this parameter is, the more redundancy we add, the easier to correct errors.

We want to find the proper correction (red path in simulator). The main correction mechanism is that if we are expanding the proper correction - everything is fine, but in each step of expanding a wrong correction, we have p_d

 

probability of realizing it. With p_d large enough, the number of corrections we should check doesn't longer grow exponentially.

At each step there is known tree structure and using it we choose the most probable leaf to expand.

 

I've realized that for practical correction methods (not requiring exponential correction time), we rather need a bit more redundancy than theoretical (Shannon's) limit. Redundancy allows to reduce the number of corrections to

 

consider. In practical correction methods we rather have to elongate corrections and so we have to assume that the expected number of corrections up to given moment is finite, what requires more redundancy than Shannon's limit (observe

 

that block codes fulfills this assumption).

This limit is calculated in the last version of the paper (0902.0271). The basic correction algorithm (as in the simulator) works for a bit worse limit (needs larger encoded file by at most 13%), but it can probably be improved.

Finally this new family of random trees has two phase transitions - for small p_d < p_d^0 the tree will immediately expand exponentially. For p_d^0 < p_d < p_d^2 the tree has generally small width, but rare high error concentrations

 

make that its expected width is infinite (like long tail in probability distribution). For p_d>p_d^2 it has finite expected width.

 

Used today error correction methods works practically only for very low noise (p_b < 0.01). Presented approach works well for any noise (p_b < 0.5).

For small noises it needs size of encoded file practically like for Shannon's limit. The difference starts for large noises: it needs file size at most twice larger than the limit.

Practical method for large noises give new way to increase capacity of transmition lines and storage devices - for example place two bits where we would normally place one - the cost is large noise increase, but we can handle it now.

 

For extremely large noises, we can no longer use ANS. On fig. 3 of the paper is shown how to handle it. For example if we have to increase the size of the file 100 times, we can encode each bit in 100 bits - encode 1 as (11...111 XOR

 

'hash value of already encoded message'. The same with 0. Now while creating the tree, each split will have different number of corrected bits - different weight.

Link to comment
Share on other sites

  • 1 year later...

I defended my PhD in computer science (now in progress in physics ), which half was was about this new approach to data correction - basically it's extended convolutional codes concept to use much larger states (which require working not on the whole space of states as usual, but only on used tree of states and allows to practically complete repair in linear time) and using entropy coder to add redundancy (simultaneous data compression and rate can be changed fluently).

The thesis is the paper from arxiv with a few small improvements ( can be downloaded from http://tcs.uj.edu.pl/graduates.php?degree=1〈=0 )

Here is presentation I've used - with a few new pictures which should make understanding concepts (and asymmetric numeral systems) easier and e.g. some comparison to LDPC.

Link to comment
Share on other sites

  • 1 year later...

I apology for digging this thread up, but finally there is practical implementation and it beats modern state of arts methods in many application. It can be seen as greatly improved convolutional code-like concept – for example no longer using convolution, but carefully designed extremely fast operation allowing to work on much larger states instead. Other main improvements are using bidirectional decoding and heap (logarithmic complexity) instead of stubbornly used stack (linear complexity). For simplicity it will be called Correction Trees (CT).

The most important improvement is that it can handle larger channel damage for the same rate. Adding redundancy to double (rate ½) or triple (rate 1/3) the message size theoretically should allow to completely repair up to correspondingly 11% or 17.4% damaged bits for Binary Symmetric Channel (each bit has independently this probability to be flipped). Unfortunately, this Shannon limit is rather unreachable - in practice we can only reduce Bit Error Rate (BER) if noise is significantly lower than this limit. Turbo Codes (TC) and Low Density Parity Check (LDPC) are nowadays seen as teh best methods – here is comparison of some their implementations with CT approach – output BER to input noise level:

 

comaprison.jpg

 

We can see that CT still repairs when the others has given up – making it perfect for extreme application like far space or underwater communication. Unfortunately repairing such extreme noises requires extreme resources – software implementation on modern PC decodes a few hundreds bytes per second for extreme noises. Additionally, using more resources the correction capability can be further improved (lowering line in the figure above).

From the other side, CT encoding is extremely fast and correction for low noises is practically for free – like up to 5-6% for rate ½. In comparison, TC correction always requires a lot of calculation, while LDPC additionally requires also a lot of work for encoding only.

So in opposite to them, CT is just perfect for e.g. hard discs – everyday work uses low noise, so using CT would make it extremely cheap. From the other hand, if it is really badly damaged, there is still correction possible but it becomes costly. Such correction itself could be also made outside, allowing for further improvement of correction capabilities.

 

Paper: http://arxiv.org/abs/1204.5317

Implementation: https://indect-project.eu/correction-trees/

Simulator: http://demonstrations.wolfram.com/CorrectionTrees/

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.