Jump to content

File Type Identification


RyanJ

Recommended Posts

Hey there guys!

 

I have a question for anyone. Maybe some code already exists for this or a good algorithm is already written down somewhere.

 

Basically what I'm trying to do is make an unknown file type identifier. I'm trying to find if there are any .NET implementations of a file-type matching algorithm out there but I can't seem to find one. Or a good description of an implementation in another language.

 

If there are none then can anyone point me in the general direction of writing one? I'm looking for help with the project so if anyone else is interested in this let me know.

 

Cheers!

Link to comment
Share on other sites

probably one that takes the characters after the last period in the name (there can be more than one, but only the last one is important - except for exceptions such as .tar.gz), then goes to a site in inputs it as a search, then reformats the page by removing excess info (not the best idea if the website constantly redesigns itself...)

Link to comment
Share on other sites

I think Ryan is talking about identifying the file type by looking at the file's contents rather than at the file's name.

 

The Unix file tool does a semi-decent job of doing just that.

Link to comment
Share on other sites

I think Ryan is talking about identifying the file type by looking at the file's contents rather than at the file's name.

 

The Unix file tool does a semi-decent job of doing just that.

 

That's the idea. Though that uses a trick called magic numbers which misses most file types these days.

Link to comment
Share on other sites

You mean something like this?: http://filext.com/

 

I think there may be some trouble writing a program that can determine the exact application the file is connected to since file extension nomenclature is not strictly policed. As such you will have file types with the same names that are actually completely different formats.

 

To really determine the root application you will need to get in and read the files themselves and match it against a structure database in the same what a virus scanner sniffs out infected files.

Link to comment
Share on other sites

I know that. Actually that's what I'm working on at the moment :)

 

I'm thinking of using pattern match and possible techniques such as entropy and compressibility in cases where patterns are not clear cut. The program is being written in C# and WPF so if anyone wants to take a look and work on this with me - feel free to let me know :)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.