Jump to content
Sign in to follow this  
RyanJ

File Type Identification

Recommended Posts

Hey there guys!

 

I have a question for anyone. Maybe some code already exists for this or a good algorithm is already written down somewhere.

 

Basically what I'm trying to do is make an unknown file type identifier. I'm trying to find if there are any .NET implementations of a file-type matching algorithm out there but I can't seem to find one. Or a good description of an implementation in another language.

 

If there are none then can anyone point me in the general direction of writing one? I'm looking for help with the project so if anyone else is interested in this let me know.

 

Cheers!

Share this post


Link to post
Share on other sites

i don't know but,

 

files types have extension .xxx and its content start with a unique code [ CODE ... ]

Share this post


Link to post
Share on other sites

probably one that takes the characters after the last period in the name (there can be more than one, but only the last one is important - except for exceptions such as .tar.gz), then goes to a site in inputs it as a search, then reformats the page by removing excess info (not the best idea if the website constantly redesigns itself...)

Share this post


Link to post
Share on other sites

I think Ryan is talking about identifying the file type by looking at the file's contents rather than at the file's name.

 

The Unix file tool does a semi-decent job of doing just that.

Share this post


Link to post
Share on other sites
I think Ryan is talking about identifying the file type by looking at the file's contents rather than at the file's name.

 

The Unix file tool does a semi-decent job of doing just that.

 

That's the idea. Though that uses a trick called magic numbers which misses most file types these days.

Share this post


Link to post
Share on other sites

I know. It uses a pattern matching approach. However it also fails with files that have no structure, such as ISO files.

Share this post


Link to post
Share on other sites

You mean something like this?: http://filext.com/

 

I think there may be some trouble writing a program that can determine the exact application the file is connected to since file extension nomenclature is not strictly policed. As such you will have file types with the same names that are actually completely different formats.

 

To really determine the root application you will need to get in and read the files themselves and match it against a structure database in the same what a virus scanner sniffs out infected files.

Share this post


Link to post
Share on other sites

I know that. Actually that's what I'm working on at the moment :)

 

I'm thinking of using pattern match and possible techniques such as entropy and compressibility in cases where patterns are not clear cut. The program is being written in C# and WPF so if anyone wants to take a look and work on this with me - feel free to let me know :)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.