Jump to content

File Format Problem


herme3

Recommended Posts

I don't really understand the problem described in this article: http://news.bbc.co.uk/2/hi/technology/6265976.stm

 

They seem to be saying that the contents of a document will be lost if nobody is able to find a copy of the software that was used to create the document. In all word processing programs I'm aware of, the text itself is stored as plain text within the file unless you encrypt it. You can open an old Word document in Notepad and still retrieve the text. You would have to set the font and style information again, but why is it such a large problem?

Link to comment
Share on other sites

They seem to be saying that the contents of a document will be lost if nobody is able to find a copy of the software that was used to create the document. In all word processing programs I'm aware of, the text itself is stored as plain text within the file unless you encrypt it. You can open an old Word document in Notepad and still retrieve the text. You would have to set the font and style information again, but why is it such a large problem?

 

A lot of the time it's a lot harder than that. Suppose you have government records stored in Word files and you can't use Office. How do you retrieve that many files? It would be a monumental effort to apply the correct formatting to hundreds of documents, and if things like "Track Changes" have been done, it would be even harder.

 

Picking a format you know you'll be able to read fifty years from now, even if it takes writing a piece of software for your shiny new quantum computers, is a lot nicer. Store files in the OpenDocument format and even if all of the OpenOffice and OASIS people are dead and the software long-gone, you'll be able to dig up a copy of the standard and write a program to extract all of the text.

Link to comment
Share on other sites

how hard could it be to right up a script to convert them all to a different format (say .odt)? all you would have to do is run that every time there is a major format change and you are still capable of accessing the old files.

 

If you have hundreds of terabytes of data, like the article describes, it's not that easy. And this is a government agency, too, so inefficiency is at work.

Link to comment
Share on other sites

If they're only worried about preserving the information itself they should just archive it as plain old ascii text. It's not like vi is going to become unavailable.

Link to comment
Share on other sites

I don't really understand the problem described in this article: http://news.bbc.co.uk/2/hi/technology/6265976.stm

 

They seem to be saying that the contents of a document will be lost if nobody is able to find a copy of the software that was used to create the document. In all word processing programs I'm aware of, the text itself is stored as plain text within the file unless you encrypt it. You can open an old Word document in Notepad and still retrieve the text. You would have to set the font and style information again, but why is it such a large problem?

 

Try opening one of these in Notepad and you might see what the problem is. Note: The visible contents of the two files are identical.

http://www.caam.rice.edu/~caam452/CAAM452Lecture4b.ppt

http://www.caam.rice.edu/~caam452/CAAM452Lecture4b.ppf

Link to comment
Share on other sites

A lot of the time it's a lot harder than that. Suppose you have government records stored in Word files and you can't use Office. How do you retrieve that many files? It would be a monumental effort to apply the correct formatting to hundreds of documents, and if things like "Track Changes" have been done, it would be even harder.

 

Yes, reformatting a large number of documents might be difficult. However, I wouldn't consider any knowledge to be lost if the text itself is kept in plain text. You could just scroll past the unreadable code, and copy-paste the plain text contained within the document.

 

Picking a format you know you'll be able to read fifty years from now, even if it takes writing a piece of software for your shiny new quantum computers, is a lot nicer. Store files in the OpenDocument format and even if all of the OpenOffice and OASIS people are dead and the software long-gone, you'll be able to dig up a copy of the standard and write a program to extract all of the text.

 

This is very strange. I just compared the raw text of a Word document and an OpenOffice document in NotePad. The Word document had a little formatting data at the top, but the rest of the text was easy to read. In the OpenOffice document, the whole document appears to be written in unreadable code. Why is the document encrypted if I didn't choose to encrypt it?

Link to comment
Share on other sites

Yes, reformatting a large number of documents might be difficult. However, I wouldn't consider any knowledge to be lost if the text itself is kept in plain text. You could just scroll past the unreadable code, and copy-paste the plain text contained within the document.

And if you have terabytes of government data, do you really want to spend the time to copy/paste it all?

 

This is very strange. I just compared the raw text of a Word document and an OpenOffice document in NotePad. The Word document had a little formatting data at the top, but the rest of the text was easy to read. In the OpenOffice document, the whole document appears to be written in unreadable code. Why is the document encrypted if I didn't choose to encrypt it?

OpenOffice documents are actually several files zipped together. Rename the file to file.zip and open it with a decompressor program. The same is true for Office 2007 documents.

Link to comment
Share on other sites

And if you have terabytes of government data, do you really want to spend the time to copy/paste it all?

 

It could be done whenever the information is necessary. There won't be a time when every terabyte of data needs to be opened at once. When a certain document is needed, just open it in Notepad instead of the original application. From there, the data can be copied into open source format.

 

OpenOffice documents are actually several files zipped together. Rename the file to file.zip and open it with a decompressor program. The same is true for Office 2007 documents.

 

I wonder why the documents are zipped? Unless you are saving a major server's log files or other large files, compressing each text document shouldn't make much of a difference on modern hard drives and flash memory devices. Wouldn't it be better to reduce the open/save time by not compressing them?

Link to comment
Share on other sites

I wonder why the documents are zipped? Unless you are saving a major server's log files or other large files, compressing each text document shouldn't make much of a difference on modern hard drives and flash memory devices. Wouldn't it be better to reduce the open/save time by not compressing them?

 

zipped as in archive. it keeps the files together as one document. makes it easier to handle. also, any compression used will not make a noticable effect on speed as todays processors are generally multi-core and extremely fast.

Link to comment
Share on other sites

It could be done whenever the information is necessary. There won't be a time when every terabyte of data needs to be opened at once.

 

unless you're searching thru the text for a key-phrase ;)

 

or unless you want casual access, without the need to copy/paste before you can access it.

Link to comment
Share on other sites

I wonder why the documents are zipped? Unless you are saving a major server's log files or other large files, compressing each text document shouldn't make much of a difference on modern hard drives and flash memory devices. Wouldn't it be better to reduce the open/save time by not compressing them?

 

It's because an OpenDocument file (and an OpenXML file) consists of more than one file. There's usually a file with the text structure, one with a style sheet, and a folder containing any images used in the document. To make it easy to handle, they're all zipped up into one file so you don't have to deal with dozens of files sitting around for all three documents you have.

Link to comment
Share on other sites

There are ways of doing this, and I have the solution for you, it's simple:

 

Convert data to instructions for rebuilding. Like this:

 

Type:

Hello

Make font 1 cm high, and red:

world

end document.

 

Then we need something to separate commands from text in a clear way. I know, let's use angular brackets:

 

<text>Hello<yey high, red>World

 

Great, let's refine a little. Text is already there, and color needs to be more flexible, so

 

hello<font color=#FF0000 size=4>world</font>

 

Cool. Let's make something like this. Even if nobody can open it, someone can easily write a program that parses this and saves in any format I want.

 

I'm a genius.

 

It's so simple it can be even used in jokes. Look:

 

</sarcasm>

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.