Doc xls ppt support by leizhao175 · Pull Request #223 · sindresorhus/file-type

leizhao175 · 2019-06-30T07:27:54Z

Hi,

I have added an extension to your lib to support .doc, .xls, and .ppt (97 ~ 2003) version.

The logic is based on http://fileformats.archiveteam.org/wiki/Microsoft_Compound_File.

As you may know, to identify these file types basically requires loading the whole file into chunk, therefore, I have to make a little change on the interface. But I tried my best to minimise the impact close to zero.

Please let me know if you like it, or if I missed anything.

Here are my modifications:

fileType(chunk)

If the chunk is not a 'msi' file, no change to the response.
If the chunk is one of doc, xls, ppt, and it is LARGE enough to identify the type, it will return:
{ext: 'doc',mime: 'application/msword'}
{ext: 'xls',mime: 'application/vnd.ms-excel'}
{ext: 'ppt',mime: 'application/vnd.ms-powerpoint'}
If the chunk is one of doc, xls, ppt, and it is not large enough to identify the type, it will return:
{ext: 'msi', mime: 'application/x-msi', minimumRequiredBytes: 1234567}
Which is almost same as existing response, with an extra field 'minimumRequiredBytes', it tells user how large the chunk is required to identify this file. And if user wants, he / she can re-invoke the function with a larger chunk.

fileType.stream()

I have modified this function so it can automatically load extra chunk from the stream if it's required.
This stream still returns a PassThrough stream with extra field fileType as it is
If the stream is a .doc, .xls, .ppt, this fun will return the right type as listed above to fileType
No impact to users, as this function still accept and return a stream, no change.

testFile() & testFileFromStream()

I have to light modify these two functions in test.js so they can identify the new field 'minimumRequiredBytes', and provide a larger chunk for testing.

Impact to users:

For users using the stream function, there is no impact to them.
For users using the chunk function, and if they don't care '.doc | .xls | .ppt', there is no impact to them as well, as they will get the same result as it is with an extra field 'minimumRequiredBytes', which they should just ignore it.
For users using the chunk function, and they do care '.doc | .xls | .ppt'. They must be new users as it was not available. And we can help them now :)

2, Modify `fileType.stream()`, so it can read from the steam twice. First time it will try to read a small chunk (minimumBytes) for fileType(). But if it's a doc / xls / ppt file, it will read it again with a larger chunk for fileType()

…dard does not support promise().then() Add unit test cases

sindresorhus · 2019-06-30T07:43:01Z

While I appreciate the PR, I'm not really interested in supporting these ancient formats. Especially not when they cause lots of churn to the code base. I would recommend making a separate module that only checks for those formats and could be used in combination with file-type, or a fork of file-type.

ajayvignesh01 · 2025-12-18T05:42:55Z

For anyone stumbling upon this, made a custom detector that you can use with this library to detect legacy doc, xls, and ppt files.

import { FileTypeParser, type Detector } from 'file-type'

const parser = new FileTypeParser({ customDetectors: [this.legacyOfficeDetector] })
const fileType = await parser.fromBuffer(...)

// Custom detector for legacy Microsoft Office files (.doc, .xls, .ppt)
// MS-CFB (Compound File Binary Format) structure:
// - Header: first 512 bytes (we only need first 52 bytes for key fields)
// - Directory stream: contains root entry with CLSID that identifies file type
// Ref: https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-cfb
const legacyOfficeDetector: Detector = {
  id: 'legacy-office',
  async detect(tokenizer) {
    // CFBF signature: D0 CF 11 E0 A1 B1 1A E1
    const cfbfSignature = [0xd0, 0xcf, 0x11, 0xe0, 0xa1, 0xb1, 0x1a, 0xe1]

    // Read minimum header bytes needed (52 bytes covers signature + sector shift + directory sector location)
    const headerBuffer = new Uint8Array(52)
    const headerBytesRead = await tokenizer.peekBuffer(headerBuffer, {
      length: 52,
      mayBeLess: true
    })

    // Need at least 52 bytes to read required header fields
    if (headerBytesRead < 52) {
      return undefined
    }

    // Check CFBF signature
    if (!cfbfSignature.every((value, index) => value === headerBuffer[index])) {
      return undefined
    }

    // Validate sector shift (offset 30): must be 9 (512-byte sectors) or 12 (4096-byte sectors)
    const sectorShift = headerBuffer[30]
    if (sectorShift !== 9 && sectorShift !== 12) {
      return undefined
    }
    const sectorSize = 1 << sectorShift

    // Read _sectDirStart (offset 48-51, little-endian unsigned 32-bit)
    // This is the sector number of the first directory sector
    // Note: JS bitwise ops return signed 32-bit ints, so values >= 0x80000000
    // become negative. Use >>> 0 to reinterpret as unsigned for spec compliance.
    const sectDirStart =
      (headerBuffer[48] |
        (headerBuffer[49] << 8) |
        (headerBuffer[50] << 16) |
        (headerBuffer[51] << 24)) >>>
      0

    // Check for special values per MS-CFB spec:
    // ENDOFCHAIN (0xFFFFFFFE) = end of sector chain
    // FREESECT (0xFFFFFFFF) = unallocated sector
    if (sectDirStart >= 0xfffffffe) {
      return undefined
    }

    // Calculate CLSID location in file:
    // - 512 bytes for header
    // - sectDirStart * sectorSize to reach directory sector
    // - 80 bytes offset to CLSID within root directory entry
    const clsidOffset = 512 + sectDirStart * sectorSize + 80
    const requiredLength = clsidOffset + 16

    // Read enough bytes to reach the CLSID
    const buffer = new Uint8Array(requiredLength)
    const bytesRead = await tokenizer.peekBuffer(buffer, {
      length: requiredLength,
      mayBeLess: true
    })

    // Verify we read enough bytes
    if (bytesRead < requiredLength) {
      return undefined
    }

    // Helper to check CLSID at the calculated offset
    const checkClsid = (clsid: number[]) =>
      clsid.every((value, i) => value === buffer[clsidOffset + i])

    // CLSID for .doc files: {00020906-0000-0000-C000-000000000046}
    const docClsid = [
      0x06, 0x09, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0xc0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
      0x46
    ]
    if (checkClsid(docClsid)) {
      return { ext: 'doc', mime: 'application/msword' }
    }

    // CLSIDs for .xls files (two variants):
    // {00020810-0000-0000-C000-000000000046}
    // {00020820-0000-0000-C000-000000000046}
    const xlsClsid1 = [
      0x10, 0x08, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0xc0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
      0x46
    ]
    const xlsClsid2 = [
      0x20, 0x08, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0xc0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
      0x46
    ]
    if (checkClsid(xlsClsid1) || checkClsid(xlsClsid2)) {
      return { ext: 'xls', mime: 'application/vnd.ms-excel' }
    }

    // CLSID for .ppt files: {64818D10-4F9B-11CF-86EA-00AA00B929E8}
    const pptClsid = [
      0x10, 0x8d, 0x81, 0x64, 0x9b, 0x4f, 0xcf, 0x11, 0x86, 0xea, 0x00, 0xaa, 0x00, 0xb9, 0x29,
      0xe8
    ]
    if (checkClsid(pptClsid)) {
      return { ext: 'ppt', mime: 'application/vnd.ms-powerpoint' }
    }

    return undefined
  }
}

Borewit · 2025-12-20T18:52:47Z

@ajayvignesh01 , if I open a repo, to turn this code into a plugin (@file-type/compound), are you willing to contribute?

Borewit · 2025-12-23T19:17:18Z

@ajayvignesh01, I created this repo: https://github.com/Borewit/file-type-cfbf

You mind to add your code there as a PR:

If you can create a commit like this: Borewit/file-type-cfbf@75fbb7d

That way I can make sure you are properly credited as the original author.

ajayvignesh01 · 2025-12-24T21:54:53Z

@ajayvignesh01, I created this repo: https://github.com/Borewit/file-type-cfbf

You mind to add your code there as a PR:

If you can create a commit like this: Borewit/file-type-cfbf@75fbb7d

That way I can make sure you are properly credited as the original author.

Sure! On vacation, can get to it next week

Borewit · 2026-01-03T15:18:43Z

Sure! On vacation, can get to it next week

A kind reminder @ajayvignesh01

ajayvignesh01 · 2026-01-05T14:57:15Z

Sure! On vacation, can get to it next week

A kind reminder @ajayvignesh01

Hey, thanks for the reminder. Will submit a PR later today!

ajayvignesh01 · 2026-01-06T03:05:48Z

@Borewit can you send me an invite to the repo again please - it expired

Borewit · 2026-01-06T08:17:27Z

@Borewit can you send me an invite to the repo again please - it expired

Done

ajayvignesh01 · 2026-01-06T17:00:18Z

@Borewit can you send me an invite to the repo again please - it expired

Done

Done, thanks!

Borewit/file-type-cfbf#5

Zhao, Lei (AU - Sydney) and others added 2 commits June 30, 2019 12:37

1, Add support for .doc, .xls, .ppt

36e5c94

2, Modify `fileType.stream()`, so it can read from the steam twice. First time it will try to read a small chunk (minimumBytes) for fileType(). But if it's a doc / xls / ppt file, it will read it again with a larger chunk for fileType()

Convert index.js fileType.stream() to async function as the test stan…

5043585

…dard does not support promise().then() Add unit test cases

sindresorhus closed this Jun 30, 2019

Borewit added a commit to Borewit/file-type-cfbf that referenced this pull request Dec 23, 2025

Add detection code based on sindresorhus/file-type#223 (comment)

e8fb7af

Borewit added a commit to Borewit/file-type-cfbf that referenced this pull request Dec 24, 2025

Add detection code based on sindresorhus/file-type#223 (comment)

75fbb7d

Borewit mentioned this pull request Jan 8, 2026

List @file-type/cfbf plugin #791

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Doc xls ppt support#223

Doc xls ppt support#223
leizhao175 wants to merge 2 commits intosindresorhus:masterfrom
leizhao175:doc-xls-ppt-support

leizhao175 commented Jun 30, 2019

Uh oh!

sindresorhus commented Jun 30, 2019

Uh oh!

ajayvignesh01 commented Dec 18, 2025

Uh oh!

Borewit commented Dec 20, 2025

Uh oh!

Borewit commented Dec 23, 2025 •

edited

Loading

Uh oh!

ajayvignesh01 commented Dec 24, 2025

Uh oh!

Borewit commented Jan 3, 2026

Uh oh!

ajayvignesh01 commented Jan 5, 2026

Uh oh!

ajayvignesh01 commented Jan 6, 2026

Uh oh!

Borewit commented Jan 6, 2026

Uh oh!

ajayvignesh01 commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

leizhao175 commented Jun 30, 2019

Uh oh!

sindresorhus commented Jun 30, 2019

Uh oh!

ajayvignesh01 commented Dec 18, 2025

Uh oh!

Borewit commented Dec 20, 2025

Uh oh!

Borewit commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajayvignesh01 commented Dec 24, 2025

Uh oh!

Borewit commented Jan 3, 2026

Uh oh!

ajayvignesh01 commented Jan 5, 2026

Uh oh!

ajayvignesh01 commented Jan 6, 2026

Uh oh!

Borewit commented Jan 6, 2026

Uh oh!

ajayvignesh01 commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Borewit commented Dec 23, 2025 •

edited

Loading