5

There are http.DetectContentType([]byte) function in net/http package. But only limited number of types are supported. How to add support of docx, doc, xls, xlsx, ppt, pps, odt, ods, odp files not by extension, but by the content. As far as I know, there are some problems, because docx/xlsx/pptx/odp/odt files has the same signature as the zip file (50 4B 03 04).

2
  • 1
    golang.org/pkg/mime Commented Apr 24, 2015 at 3:27
  • 1
    @SalvadorDali The mime package is useful, but the question specifically asks about detection based on content, not extension. Commented Apr 24, 2015 at 4:09

4 Answers 4

8

Disclaimer: I'm the author of mimetype.

For anyone having the same problem 3 years later, nowadays the packages for mime type detection based on the content are the following:

  • filetype

    • pure go, no c bindings
    • can be extented to detect new mime types
    • has issues with files which pass as more than one mime type (ex: xlsx and docx passing as zip) because it stores matching functions in a map, thus it does not guarantee the order of traversal
    • limited number of detected mime types
  • magicmime

    • needs libmagic-dev installed
    • of the 3, it has highest number of detected mime types
    • can be extended, albeit harder... man magic
    • libmagic is not thread safe
  • mimetype

    • pure go, no c bindings
    • higher number of detected mime types than filetype
    • is thread safe
    • can be extended
Sign up to request clarification or add additional context in comments.

Comments

3

For files with x at the end are relatively easy to detect. Just unzip it and read .rels/_rels file. It contains path to the main file in document. It denoted by namespace http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument. Just check its name. It's document.xml for docx, workbook.xml for xlsx and presentation.xml for pptx.

More info here can be found here ECMA-376.

Binary formats harder to detect. Basically you need to read MS-CFB filesystem and check for entries:

  • WordDocument for doc
  • Workbook or Book for xls
  • PowerPoint Document for ppt
  • EncryptedPackage means file is encrypted.

1 Comment

any tips on how to get started with this?
1

There's currently no way to extend http.DetectContentType as it uses a fixed, unexported slice of "sniffers": https://golang.org/src/net/http/sniff.go (sniffSignatures on line 49 at the time of writing).

Also, I looked quickly through godoc.org in search of a better package but didn't find any that is extensible and content-oriented as you require.

My advice would be: build your own package, guided by Go's content sniffer implementation (which follows https://mimesniff.spec.whatwg.org/).

Edit: If you're willing to use CGO and you're on nix, you could use libmagic bindings like for example https://github.com/jteeuwen/magic.

Comments

1

I found mimemagic, which I find preferable to magicmime since it doesn't use cgo. But magicmime is better at differentiating between application/zip and office file types.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.