Skip to content

feat(parse): add support for legacy .doc and .xls file formats#652

Merged
qin-ctx merged 3 commits intovolcengine:mainfrom
ngoclam9415:feat/legacy-doc-xls-support
Mar 16, 2026
Merged

feat(parse): add support for legacy .doc and .xls file formats#652
qin-ctx merged 3 commits intovolcengine:mainfrom
ngoclam9415:feat/legacy-doc-xls-support

Conversation

@ngoclam9415
Copy link
Copy Markdown
Contributor

@ngoclam9415 ngoclam9415 commented Mar 16, 2026

Summary

  • Add LegacyDocParser for Word 97-2003 .doc files using olefile (OLE2 binary format parsing with piece table support and multi-level fallbacks)
  • Extend ExcelParser to handle legacy .xls files using xlrd, while keeping openpyxl for .xlsx/.xlsm
  • New dependencies: olefile>=0.47, xlrd>=2.0.1

Motivation

Directories containing legacy .doc and .xls files currently fail with UnsupportedDirectoryFilesError or openpyxl does not support the old .xls file format errors. These formats are still common in industrial/enterprise environments.

Changes

  • openviking/parse/parsers/legacy_doc.py — New parser for .doc files
  • openviking/parse/parsers/excel.py — Added _convert_xls_to_markdown() method using xlrd
  • openviking/parse/registry.py — Registered LegacyDocParser for .doc extension
  • pyproject.toml — Added olefile and xlrd dependencies

Test plan

  • Verify .doc files parse correctly (Word 97-2003 binary format)
  • Verify .xls files parse correctly (Excel 97-2003 binary format)
  • Verify .docx and .xlsx parsing unchanged
  • Existing test suite passes (22 passed)
  • Directory scan no longer raises UnsupportedDirectoryFilesError for .doc/.xls files

Add LegacyDocParser using olefile to extract text from Word 97-2003
binary .doc files via OLE2 stream parsing with piece table support
and multi-level fallbacks.

Extend ExcelParser to handle .xls files using xlrd, branching the
parse logic based on file extension while keeping openpyxl for
.xlsx/.xlsm.

New dependencies: olefile>=0.47, xlrd>=2.0.1
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 16, 2026

CLA assistant check
All committers have signed the CLA.

row_data = []
for col_idx in range(sheet.ncols):
cell = sheet.cell(row_idx, col_idx)
row_data.append(str(cell.value) if cell.value is not None else "")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] xlrd returns raw float serial numbers for date cells (e.g. 44927.0 instead of 2023-01-01). cell.ctype should be checked to handle dates (and booleans) properly.

Suggested fix:

for col_idx in range(sheet.ncols):
    cell = sheet.cell(row_idx, col_idx)
    if cell.ctype == xlrd.XL_CELL_DATE:
        try:
            dt = xlrd.xldate_as_tuple(cell.value, wb.datemode)
            row_data.append(f"{dt[0]:04d}-{dt[1]:02d}-{dt[2]:02d}")
        except Exception:
            row_data.append(str(cell.value))
    elif cell.ctype == xlrd.XL_CELL_BOOLEAN:
        row_data.append("TRUE" if cell.value else "FALSE")
    elif cell.value is not None:
        row_data.append(str(cell.value))
    else:
        row_data.append("")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in dbf4d4c with:

  • formatting_info=True on xlrd.open_workbook() so date cells are properly detected via XL_CELL_DATE
  • Added handling for all cell types: XL_CELL_ERROR (mapped to #DIV/0!, #N/A, etc.), XL_CELL_BOOLEAN, XL_CELL_BLANK/EMPTY
  • Date formatting now includes time component when non-zero
  • Integers display without trailing .0
  • Added on_demand=True + release_resources() for memory efficiency

Also hardened the .doc parser with stream size caps, FIB version check, and bounds validation.

Check cell.ctype for XL_CELL_DATE and XL_CELL_BOOLEAN to avoid
outputting raw float serial numbers for dates and numeric 0/1 for
booleans.
excel.py:
- Enable formatting_info=True so xlrd detects date cells properly
- Add on_demand=True and release_resources() for memory efficiency
- Handle all xlrd cell types: DATE (with time), BOOLEAN, ERROR, BLANK, EMPTY
- Display integers without trailing .0
- Extract cell formatting to _format_xls_cell static method

legacy_doc.py:
- Add 50MB stream size cap to prevent DoS from crafted files
- Cap ccpText at 10M chars to prevent memory exhaustion
- Add FIB version check (require Word 97+ / nFib >= 0x00C1)
- Add minimum buffer length check before struct.unpack_from
- Fix Grpprl skip loop to prevent spin on zero-length entries
- Add _clean_word_text for \x0B (soft break) and \x0C (section break)
- Log warnings for pieces extending beyond stream bounds
- Cap fallback extract to max stream size
@qin-ctx qin-ctx merged commit 559eef3 into volcengine:main Mar 16, 2026
1 check passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants