Skip to content

Doc xls ppt support#223

Closed
leizhao175 wants to merge 2 commits intosindresorhus:masterfrom
leizhao175:doc-xls-ppt-support
Closed

Doc xls ppt support#223
leizhao175 wants to merge 2 commits intosindresorhus:masterfrom
leizhao175:doc-xls-ppt-support

Conversation

@leizhao175
Copy link

Hi,

I have added an extension to your lib to support .doc, .xls, and .ppt (97 ~ 2003) version.

The logic is based on http://fileformats.archiveteam.org/wiki/Microsoft_Compound_File.

As you may know, to identify these file types basically requires loading the whole file into chunk, therefore, I have to make a little change on the interface. But I tried my best to minimise the impact close to zero.

Please let me know if you like it, or if I missed anything.

Here are my modifications:

fileType(chunk)

  • If the chunk is not a 'msi' file, no change to the response.
  • If the chunk is one of doc, xls, ppt, and it is LARGE enough to identify the type, it will return:
    {ext: 'doc',mime: 'application/msword'}
    {ext: 'xls',mime: 'application/vnd.ms-excel'}
    {ext: 'ppt',mime: 'application/vnd.ms-powerpoint'}
  • If the chunk is one of doc, xls, ppt, and it is not large enough to identify the type, it will return:
    {ext: 'msi', mime: 'application/x-msi', minimumRequiredBytes: 1234567}
    Which is almost same as existing response, with an extra field 'minimumRequiredBytes', it tells user how large the chunk is required to identify this file. And if user wants, he / she can re-invoke the function with a larger chunk.

fileType.stream()

  • I have modified this function so it can automatically load extra chunk from the stream if it's required.
  • This stream still returns a PassThrough stream with extra field fileType as it is
  • If the stream is a .doc, .xls, .ppt, this fun will return the right type as listed above to fileType
  • No impact to users, as this function still accept and return a stream, no change.

testFile() & testFileFromStream()

  • I have to light modify these two functions in test.js so they can identify the new field 'minimumRequiredBytes', and provide a larger chunk for testing.

Impact to users:

  • For users using the stream function, there is no impact to them.
  • For users using the chunk function, and if they don't care '.doc | .xls | .ppt', there is no impact to them as well, as they will get the same result as it is with an extra field 'minimumRequiredBytes', which they should just ignore it.
  • For users using the chunk function, and they do care '.doc | .xls | .ppt'. They must be new users as it was not available. And we can help them now :)

Zhao, Lei (AU - Sydney) and others added 2 commits June 30, 2019 12:37
2, Modify `fileType.stream()`, so it can read from the steam twice. First time it will try to read a small chunk (minimumBytes) for fileType(). But if it's a doc / xls / ppt file, it will read it again with a larger chunk for fileType()
…dard does not support promise().then()

Add unit test cases
@sindresorhus
Copy link
Owner

While I appreciate the PR, I'm not really interested in supporting these ancient formats. Especially not when they cause lots of churn to the code base. I would recommend making a separate module that only checks for those formats and could be used in combination with file-type, or a fork of file-type.

@ajayvignesh01
Copy link

For anyone stumbling upon this, made a custom detector that you can use with this library to detect legacy doc, xls, and ppt files.

import { FileTypeParser, type Detector } from 'file-type'

const parser = new FileTypeParser({ customDetectors: [this.legacyOfficeDetector] })
const fileType = await parser.fromBuffer(...)

// Custom detector for legacy Microsoft Office files (.doc, .xls, .ppt)
// MS-CFB (Compound File Binary Format) structure:
// - Header: first 512 bytes (we only need first 52 bytes for key fields)
// - Directory stream: contains root entry with CLSID that identifies file type
// Ref: https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-cfb
const legacyOfficeDetector: Detector = {
  id: 'legacy-office',
  async detect(tokenizer) {
    // CFBF signature: D0 CF 11 E0 A1 B1 1A E1
    const cfbfSignature = [0xd0, 0xcf, 0x11, 0xe0, 0xa1, 0xb1, 0x1a, 0xe1]

    // Read minimum header bytes needed (52 bytes covers signature + sector shift + directory sector location)
    const headerBuffer = new Uint8Array(52)
    const headerBytesRead = await tokenizer.peekBuffer(headerBuffer, {
      length: 52,
      mayBeLess: true
    })

    // Need at least 52 bytes to read required header fields
    if (headerBytesRead < 52) {
      return undefined
    }

    // Check CFBF signature
    if (!cfbfSignature.every((value, index) => value === headerBuffer[index])) {
      return undefined
    }

    // Validate sector shift (offset 30): must be 9 (512-byte sectors) or 12 (4096-byte sectors)
    const sectorShift = headerBuffer[30]
    if (sectorShift !== 9 && sectorShift !== 12) {
      return undefined
    }
    const sectorSize = 1 << sectorShift

    // Read _sectDirStart (offset 48-51, little-endian unsigned 32-bit)
    // This is the sector number of the first directory sector
    // Note: JS bitwise ops return signed 32-bit ints, so values >= 0x80000000
    // become negative. Use >>> 0 to reinterpret as unsigned for spec compliance.
    const sectDirStart =
      (headerBuffer[48] |
        (headerBuffer[49] << 8) |
        (headerBuffer[50] << 16) |
        (headerBuffer[51] << 24)) >>>
      0

    // Check for special values per MS-CFB spec:
    // ENDOFCHAIN (0xFFFFFFFE) = end of sector chain
    // FREESECT (0xFFFFFFFF) = unallocated sector
    if (sectDirStart >= 0xfffffffe) {
      return undefined
    }

    // Calculate CLSID location in file:
    // - 512 bytes for header
    // - sectDirStart * sectorSize to reach directory sector
    // - 80 bytes offset to CLSID within root directory entry
    const clsidOffset = 512 + sectDirStart * sectorSize + 80
    const requiredLength = clsidOffset + 16

    // Read enough bytes to reach the CLSID
    const buffer = new Uint8Array(requiredLength)
    const bytesRead = await tokenizer.peekBuffer(buffer, {
      length: requiredLength,
      mayBeLess: true
    })

    // Verify we read enough bytes
    if (bytesRead < requiredLength) {
      return undefined
    }

    // Helper to check CLSID at the calculated offset
    const checkClsid = (clsid: number[]) =>
      clsid.every((value, i) => value === buffer[clsidOffset + i])

    // CLSID for .doc files: {00020906-0000-0000-C000-000000000046}
    const docClsid = [
      0x06, 0x09, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0xc0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
      0x46
    ]
    if (checkClsid(docClsid)) {
      return { ext: 'doc', mime: 'application/msword' }
    }

    // CLSIDs for .xls files (two variants):
    // {00020810-0000-0000-C000-000000000046}
    // {00020820-0000-0000-C000-000000000046}
    const xlsClsid1 = [
      0x10, 0x08, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0xc0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
      0x46
    ]
    const xlsClsid2 = [
      0x20, 0x08, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0xc0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
      0x46
    ]
    if (checkClsid(xlsClsid1) || checkClsid(xlsClsid2)) {
      return { ext: 'xls', mime: 'application/vnd.ms-excel' }
    }

    // CLSID for .ppt files: {64818D10-4F9B-11CF-86EA-00AA00B929E8}
    const pptClsid = [
      0x10, 0x8d, 0x81, 0x64, 0x9b, 0x4f, 0xcf, 0x11, 0x86, 0xea, 0x00, 0xaa, 0x00, 0xb9, 0x29,
      0xe8
    ]
    if (checkClsid(pptClsid)) {
      return { ext: 'ppt', mime: 'application/vnd.ms-powerpoint' }
    }

    return undefined
  }
}

@Borewit
Copy link
Collaborator

Borewit commented Dec 20, 2025

@ajayvignesh01 , if I open a repo, to turn this code into a plugin (@file-type/compound), are you willing to contribute?

Borewit added a commit to Borewit/file-type-cfbf that referenced this pull request Dec 23, 2025
@Borewit
Copy link
Collaborator

Borewit commented Dec 23, 2025

@ajayvignesh01, I created this repo: https://github.com/Borewit/file-type-cfbf

You mind to add your code there as a PR:

If you can create a commit like this: Borewit/file-type-cfbf@75fbb7d

That way I can make sure you are properly credited as the original author.

Borewit added a commit to Borewit/file-type-cfbf that referenced this pull request Dec 24, 2025
@ajayvignesh01
Copy link

@ajayvignesh01, I created this repo: https://github.com/Borewit/file-type-cfbf

You mind to add your code there as a PR:

If you can create a commit like this: Borewit/file-type-cfbf@75fbb7d

That way I can make sure you are properly credited as the original author.

Sure! On vacation, can get to it next week

@Borewit
Copy link
Collaborator

Borewit commented Jan 3, 2026

Sure! On vacation, can get to it next week

A kind reminder @ajayvignesh01

@ajayvignesh01
Copy link

Sure! On vacation, can get to it next week

A kind reminder @ajayvignesh01

Hey, thanks for the reminder. Will submit a PR later today!

@ajayvignesh01
Copy link

@Borewit can you send me an invite to the repo again please - it expired

@Borewit
Copy link
Collaborator

Borewit commented Jan 6, 2026

@Borewit can you send me an invite to the repo again please - it expired

Done

@ajayvignesh01
Copy link

@Borewit can you send me an invite to the repo again please - it expired

Done

Done, thanks!

Borewit/file-type-cfbf#5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants