# UNICODE AND SEQUENCES

## Recommended Posts

HI,

I'm phd student, have dataset, I'm working in sequence alignmnet.

So, need a way to encode my data then make alignmnet.

the range of my data between 1-11000

i.e if I need unique code for every value in my data I need large set of characters

can I use unicode for encoding my data?

is so, give me link, and a way of represent my data.

thanks

##### Share on other sites

HI,

I'm phd student, have dataset, I'm working in sequence alignmnet.

So, need a way to encode my data then make alignmnet.

the range of my data between 1-11000

i.e if I need unique code for every value in my data I need large set of characters

can I use unicode for encoding my data?

is so, give me link, and a way of represent my data.

thanks

Why does it need to be a character?

If you're doing it in a computer surely storing it as an integer would be fasters/more efficient.

If you need a visual representation for some reason (maybe aligning things by eye somehow?) then wouldn't representing them as colours or simple patterns of colours be more efficient?

Perhaps you could explain a bit more about the nature of this data and what you are doing?

##### Share on other sites

Why does it need to be a character?

If you're doing it in a computer surely storing it as an integer would be fasters/more efficient.

If you need a visual representation for some reason (maybe aligning things by eye somehow?) then wouldn't representing them as colours or simple patterns of colours be more efficient?

Perhaps you could explain a bit more about the nature of this data and what you are doing?

I think to apply the sequence alignment operations ,I must be encode my data.

Look in this paper "Sociological Methods & Research", setion 3

the data must be first encoded into a set of sequences using a finit alphaet of states.

my data is integer with big range , need much charecters to represent each value uniquely ,

i.e my data is asequence of integers.

if you know other information , please tell me

##### Share on other sites

These questions are a little unusual for somebody who is completing a graduate degree. Your data can simply be the set of the first 11000 integers, where each integer from 1 to 11000 would be in essence a character representation or encoding of your data. The fact that it is a multi-character representation or encoding of your data is somewhat irrelevant.

I strongly suggest that you do not attempt to use Unicode. If it is absolutely imperative that the data be visually compressed to some degree I would suggest using maybe base 128. I mean really, could you imagine having to memorize the significance of a set consisting of 11000 characters? It would simply be impossible. A purpose for doing so would be difficult to establish as well. The computing device will still be working with sets of integers whose values each have been assigned a character symbol representation.

What is it that you think you are accomplishing by doing so?

As always I may be blind to some important piece of information. I hadn't posted because of this, but your response to Schrödinger's hat suggests that my assumptions are correct and that you are drawing unnecessary conclusions from the information that was made available to you.

Edited by Xittenn
##### Share on other sites

These questions are a little unusual for somebody who is completing a graduate degree. Your data can simply be the set of the first 11000 integers, where each integer from 1 to 11000 would be in essence a character representation or encoding of your data. The fact that it is a multi-character representation or encoding of your data is somewhat irrelevant.

I strongly suggest that you do not attempt to use Unicode. If it is absolutely imperative that the data be visually compressed to some degree I would suggest using maybe base 128. I mean really, could you imagine having to memorize the significance of a set consisting of 11000 characters? It would simply be impossible. A purpose for doing so would be difficult to establish as well. The computing device will still be working with sets of integers whose values each have been assigned a character symbol representation.

What is it that you think you are accomplishing by doing so?

As always I may be blind to some important piece of information. I hadn't posted because of this, but your response to Schrödinger's hat suggests that my assumptions are correct and that you are drawing unnecessary conclusions from the information that was made available to you.

Anyway, in all cases I appreciate your suggestions

Look, you see my queries it is not logical for PhD student, but I see the coding is the most important stage in my work.

Where, the design of algorithm of sequence alignment depends on method of coding.

u need ore information aout my work.

briefly,

I try find out the similarity among group of users in terms of their actions over time in online forums.

these actions are represented with integers ,so each user has sequence of integers.

Now, I need a way to represent these integers in order to apply seq. alignment algo.

you mentioned , that difficult to use unicode.

But in this paper used unicode, because the authors have large set of inputs s my situation.Where their inputs are wepages .

I hope to know the suggestions

thi s paper:

"GROUPING WEB ACCESS SEQUENCESUSING SEQUENCE ALIGNMENT METHOD"

HUPENDRA S CHORDIA PG Student, Department of Computer Engineering, SSBT, COET, Bambhori

Jalgaon, Maharsahtra 4250001, India chordiabs@yahoo.com

##### Share on other sites

you know that you can represent your codes in a higher base:

11000 (base 10 ~5 symbols) = 2AF8 (base 16 ~4 symbols) = ANO (base 32 ~3 symbols)

##### Share on other sites

you know that you can represent your codes in a higher base:

11000 (base 10 ~5 symbols) = 2AF8 (base 16 ~4 symbols) = ANO (base 32 ~3 symbols)

thanks

Khaled , I know that, but I want to avoid the compositin .

I think , but not sure it is possible in case of alignment.

look:

let say

11000(4^7) (g,t,c,a)-> DNA or 20^4 in case of amino acid

for ex. in case of DNA:

gttcaag-> code for 100 for example

gttcaaa-> code for 90 for example

the two codes are roughly alike (the alg. consider it as same seq. ), but in fact they are quite different.

this is my prolem, I'm not sure if compositin possile or not .

I think in bioinformatic forum can discuss that but someone will move my questions to computer science if I did.

many thanks again khaled

##### Share on other sites

I still don't understand why you want unicode.

To a computer, unicode still looks like 0F8A or 0110111000100101. It only comes out as a character after you put it through some kind of renderer.

Fundamentally no different to an integer, other than data types/methods for strings in high level languages usually carry around a lot of baggage that you won't need for a mathematical algorithm.

A sequence alignment algorithm can just as easily be programmed to use atoms consisting of integers as atoms consisting of characters.

Unless you need to render it for some reason and get human input, the best way of doing this is with a lower level data type.

To elaborate on your dna example (or a 4-bit per character representation of a number or code), you wouldn't have gttcaag as a sequence, it would be an element of the sequence. In this case the sequence would be (using [] to denote different elements in an array):

[gttcaag] [gaagtgc][gttcaaa][gtagatc]

Which would be just as similar to

[gttcaaa] [gaagtgc][gttcaaa][gtagatc]

as it would be to

[aaaaaaa] [gaagtgc][gttcaaa][gtagatc]

##### Share on other sites

Unicode does not map all values. If you try to encode your data onto Unicode you will not have a straightforward one-to-one mapping of values.

When combined with the human condition, encoding variants and the need to compile manageable code you will have a mess. It is not that it can't be done, it's just not practical for your purposes, even despite it having been used for demonstration.

**** Feel free to disagree with me and the fine Hat above, but you are swimming in the wrong direction. ****

The fact that you are asking how to accomplish the abomination further supports the idea that there will result a mess. Integer representation under any base will encode your data and will not break your algorithm, composition or no . . . . . Unicode may encode your data but your success rate will vary pending your experience and this conversation suggests a lack thereof.

I am not saying any of this to be rude or mean. I really do hope you all the best in your endeavor and am therefor giving you what I believe to be the best advice.

Peace to you good stranger,

Beka:D

##### Share on other sites

thanks

Khaled , I know that, but I want to avoid the compositin .

I think , but not sure it is possible in case of alignment.

look:

let say

11000(4^7) (g,t,c,a)-> DNA or 20^4 in case of amino acid

for ex. in case of DNA:

gttcaag-> code for 100 for example

gttcaaa-> code for 90 for example

the two codes are roughly alike (the alg. consider it as same seq. ), but in fact they are quite different.

this is my prolem, I'm not sure if compositin possile or not .

I think in bioinformatic forum can discuss that but someone will move my questions to computer science if I did.

many thanks again khaled

Huda, I've helped many academic students, there are people who are looking for a solution for their problem, and there

are people who, offered different solutions, insist on making things worse ...

And not just that you left offered solutions, but also you took unrelated issues into your research .. you are analyzing users' behaviors,

you have your own data, you are going to do Sequence Alignment .. why do you keep taking DNA and amino acids into your problem,

making things worse by encoding trying to encode you data using amino acid alphabet ...

The other noted problem, is that you confuse both encoding your data, and how sequence alignment is going to work on encoded data,

if your data is encoded such that your alphabet is [1, 11000], and if we encode 100 = [gttcaag], & 99 = [gttcaaa]

you don't do comparison between every bit of each symbol, you can't compare 'g' to 'a' in the above example, because the whole

block is a single symbol ! it's an atom that cannot be considered part-by-part, but as a whole .. so, we have that

$100 \neq 99 \; \Rightarrow \; [gttcaag] \; \neq \; [gttcaaa]$

remember, Sequence Alignment is not related to DNA, Queens problem did not come from chess, ...

Edited by khaled
##### Share on other sites

I still don't understand why you want unicode.

To a computer, unicode still looks like 0F8A or 0110111000100101. It only comes out as a character after you put it through some kind of renderer.

Fundamentally no different to an integer, other than data types/methods for strings in high level languages usually carry around a lot of baggage that you won't need for a mathematical algorithm.

A sequence alignment algorithm can just as easily be programmed to use atoms consisting of integers as atoms consisting of characters.

Unless you need to render it for some reason and get human input, the best way of doing this is with a lower level data type.

To elaborate on your dna example (or a 4-bit per character representation of a number or code), you wouldn't have gttcaag as a sequence, it would be an element of the sequence. In this case the sequence would be (using [] to denote different elements in an array):

[gttcaag] [gaagtgc][gttcaaa][gtagatc]

Which would be just as similar to

[gttcaaa] [gaagtgc][gttcaaa][gtagatc]

as it would be to

[aaaaaaa] [gaagtgc][gttcaaa][gtagatc]

thanks for examples,

you wondering about why I insist to encode my data instead of leaving it as raw data.

I were intending to use the same existing algorithms that designed for biological sequences instead of beginning from scratch.

But, I think it is not possible because of the large domain of my input.

In addition , I did not expect that gttcaag for example as an element , I were expecting be part of sequence as DNA or amino acid

gttcaaggaagtgcgttcaaagtagatc

thanks for clarifying , but I would like to here from u if u have more comments

Unicode does not map all values. If you try to encode your data onto Unicode you will not have a straightforward one-to-one mapping of values.

When combined with the human condition, encoding variants and the need to compile manageable code you will have a mess. It is not that it can't be done, it's just not practical for your purposes, even despite it having been used for demonstration.

**** Feel free to disagree with me and the fine Hat above, but you are swimming in the wrong direction. ****

The fact that you are asking how to accomplish the abomination further supports the idea that there will result a mess. Integer representation under any base will encode your data and will not break your algorithm, composition or no . . . . . Unicode may encode your data but your success rate will vary pending your experience and this conversation suggests a lack thereof.

I am not saying any of this to be rude or mean. I really do hope you all the best in your endeavor and am therefor giving you what I believe to be the best advice.

Peace to you good stranger,

Beka:D

I welcomes any suggestions, advices as long as the intent is help me

I hope u read what I wrote to Schrödinger's hat.

thanks

Huda, I've helped many academic students, there are people who are looking for a solution for their problem, and there

are people who, offered different solutions, insist on making things worse ...

And not just that you left offered solutions, but also you took unrelated issues into your research .. you are analyzing users' behaviors,

you have your own data, you are going to do Sequence Alignment .. why do you keep taking DNA and amino acids into your problem,

making things worse by encoding trying to encode you data using amino acid alphabet ...

The other noted problem, is that you confuse both encoding your data, and how sequence alignment is going to work on encoded data,

if your data is encoded such that your alphabet is [1, 11000], and if we encode 100 = [gttcaag], & 99 = [gttcaaa]

you don't do comparison between every bit of each symbol, you can't compare 'g' to 'a' in the above example, because the whole

block is a single symbol ! it's an atom that cannot be considered part-by-part, but as a whole .. so, we have that

$100 \neq 99 \; \Rightarrow \; [gttcaag] \; \neq \; [gttcaaa]$

remember, Sequence Alignment is not related to DNA, Queens problem did not come from chess, ...

i think u know what i were intending to do if u read what i wrote above.

ok, if the coding can no allow me to use the existing algorithms .

I can leave raw data , but in this case the domain of input will be very large, and this is diffcult in design algorithm.

so, you suggested to use multiple code to minimize the domain of input.

thanks

##### Share on other sites

ok, if the coding can no allow me to use the existing algorithms .

I can leave raw data , but in this case the domain of input will be very large, and this is diffcult in design algorithm.

so, you suggested to use multiple code to minimize the domain of input.

thanks

No, how data is represented will never make the algorithm design any harder .. you simply represent things differently,

and you can any algorithm with your encoding system, you just need to build functions that support this encoding system,

you will use the same algorithm, except when the comparison comes, instead of doing

if ( x == y )

you will define a new function that compare two symbols\blocks in your encoding system just like this

if ( equal (x, y) )

.. simple enough ?

##### Share on other sites

We're moving from computing the realm of statistics but: I did a chapter of my PhD using continuous morphometric data, nucleotide sequence data and categorical allele frequency data, so I have some idea of the statistical tools for comparison of each data type and the assumptions that go along with them.

Basically, you seem to have a string of either continuous integers or a very large suite of categorical states representing individuals. How you treat them will be somewhat dependent on what exactly you want you're using the data to answer. Are you trying to determine which individuals are more similar to each other? Number of clusters in your data? Where in the sequence individuals differ? Something else? It is really hard to determine the validity of a methodological approach without a clear idea of the actual hypothesis you're testing...

You want to concatenate these data points into a string and then treat them as you would either a string of unlinked genetic loci or as a single linked gene to compare them, right?

Having worked with these different data types, I'm seeing a lot more negatives than positives in trying to adapt your data to a sequence alignment software than treating it under a more realistic set of assumptions... how are you going to deal with issues like gap open penalties, repetitive regions, etc.

Then what sort of downstream analysis do you intend to do? Phylogenetic analysis has a large suite of assumptions which it's almost certain you'll be violating many of. (ploidy, model of substitution, minimized deep coalescent events, clock like substitution rates, etc) Population genetic methods are even worse (no unsampled populations, neutral model of evolution, etc)

There's suite of traditional statistical comparisons which at least to me would seem far more appropriate given your data. If you can let me know the hypothesis you're testing I can probably give you some suggestions on approaches to take e.g. converting categorical data to a dissilmilarity matrix, testing for clusters using PCA ordination and Bayesian cluster detection, Euclidean distance tree building, classification tree analysis, discriminant function analysis, distance based redundancy analysis etc.

There's reasons why us biologists don't try and convert other data types into sequence data, and it's almost always because there's a far more appropriate method available to answer the question at hand with the data at hand A bit more information might help us to provide suggestions...

Edited by Arete
##### Share on other sites

We're moving from computing the realm of statistics but: I did a chapter of my PhD using continuous morphometric data, nucleotide sequence data and categorical allele frequency data, so I have some idea of the statistical tools for comparison of each data type and the assumptions that go along with them.

Basically, you seem to have a string of either continuous integers or a very large suite of categorical states representing individuals. How you treat them will be somewhat dependent on what exactly you want you're using the data to answer. Are you trying to determine which individuals are more similar to each other? Number of clusters in your data? Where in the sequence individuals differ? Something else? It is really hard to determine the validity of a methodological approach without a clear idea of the actual hypothesis you're testing...

You want to concatenate these data points into a string and then treat them as you would either a string of unlinked genetic loci or as a single linked gene to compare them, right?

yes exactly, but later I knew that is not possible treat my data as single linked gene, ecause I can not find a unique code for each value.May treat it as unlinked genetic is more suitable as:

agh, tre, zca.,.....

Having worked with these different data types, I'm seeing a lot more negatives than positives in trying to adapt your data to a sequence alignment software than treating it under a more realistic set of assumptions... how are you going to deal with issues like gap open penalties, repetitive regions, etc.

Then what sort of downstream analysis do you intend to do? Phylogenetic analysis has a large suite of assumptions which it's almost certain you'll be violating many of. (ploidy, model of substitution, minimized deep coalescent events, clock like substitution rates, etc) Population genetic methods are even worse (no unsampled populations, neutral model of evolution, etc)

There's suite of traditional statistical comparisons which at least to me would seem far more appropriate given your data. If you can let me know the hypothesis you're testing I can probably give you some suggestions on approaches to take e.g. converting categorical data to a dissilmilarity matrix, testing for clusters using PCA ordination and Bayesian cluster detection, Euclidean distance tree building, classification tree analysis, discriminant function analysis, distance based redundancy analysis etc.

There's reasons why us biologists don't try and convert other data types into sequence data, and it's almost always because there's a far more appropriate method available to answer the question at hand with the data at hand A bit more information might help us to provide suggestions...

You want to concatenate these data points into a string and then treat them as you would either a string of unlinked genetic loci or as a single linked gene to compare them, right?

yes exactly, but later I knew that is not possible treat my data as single linked gene, ecause I can not find a unique code for each value.May treat it as unlinked genetic is more suitable as:

agh, tre, zca.,.....

I'm not sure if my analysis is right or not.

ok, I will give u more details.

I'm intending to find out the relationships among set of users.

the data that I got it for this purpose is representing actions of users over time in online forum.

so, I have array , each row represents actions of one user over time.

I'm trying to make alignment for thier sequence of actions to determine which individuals are more similar to each other,then making clusters and find out communities

I'm trying to find out the similarity among users that it result from influence on each other because of social relationships among them in online community.

If I used relastic techniques as you mentioned , i can make clusters, but did not find out relationships.

thanks

##### Share on other sites

can you treat user actions as a continous variable or is it strictly categorical?

I'd be leaning towards not involving any sequence data coding and using either a distance or dissimilarity matrix and a hierarchical clustering approach e.g. http://www.statmethods.net/advstats/cluster.html

That way you don't run into all sorts of issues regarding erroneous assumptions you will have to make by trying to treat your data as sequence data.

##### Share on other sites

can you treat user actions as a continous variable or is it strictly categorical?

I'd be leaning towards not involving any sequence data coding and using either a distance or dissimilarity matrix and a hierarchical clustering approach e.g. http://www.statmetho...ts/cluster.html

That way you don't run into all sorts of issues regarding erroneous assumptions you will have to make by trying to treat your data as sequence data.

my sequence is discrete

thanks

##### Share on other sites

• 2 weeks later...

I still don't understand why you want unicode.

To a computer, unicode still looks like 0F8A or 0110111000100101. It only comes out as a character after you put it through some kind of renderer.

Fundamentally no different to an integer, other than data types/methods for strings in high level languages usually carry around a lot of baggage that you won't need for a mathematical algorithm.

A sequence alignment algorithm can just as easily be programmed to use atoms consisting of integers as atoms consisting of characters.

Unless you need to render it for some reason and get human input, the best way of doing this is with a lower level data type.

To elaborate on your dna example (or a 4-bit per character representation of a number or code), you wouldn't have gttcaag as a sequence, it would be an element of the sequence. In this case the sequence would be (using [] to denote different elements in an array):

[gttcaag] [gaagtgc][gttcaaa][gtagatc]

Which would be just as similar to

[gttcaaa] [gaagtgc][gttcaaa][gtagatc]

as it would be to

[aaaaaaa] [gaagtgc][gttcaaa][gtagatc]

hi,

I passed the encoding stage and the discussion here is very handy for me.

Now plesae, I would like to know

in case where the sequences is not iological , is unitary scoring matrix suitable or I have to design new scoring matrix ?

thanks

##### Share on other sites

What downstream analyses do you intend to implement?

You can't really implement any rigorous phylogenetic or popgen methods without violating a whole suite of assumptions.

Why have you dismissed using the data in its existing form with a suite of existing multivariate methods?

It seems like you're re-inventing the wheel in the shape of a square here...

Edited by Arete

## Create an account

Register a new account