feat(parse): add support for legacy .doc and .xls file formats#652
Merged
qin-ctx merged 3 commits intovolcengine:mainfrom Mar 16, 2026
Merged
feat(parse): add support for legacy .doc and .xls file formats#652qin-ctx merged 3 commits intovolcengine:mainfrom
qin-ctx merged 3 commits intovolcengine:mainfrom
Conversation
Add LegacyDocParser using olefile to extract text from Word 97-2003 binary .doc files via OLE2 stream parsing with piece table support and multi-level fallbacks. Extend ExcelParser to handle .xls files using xlrd, branching the parse logic based on file extension while keeping openpyxl for .xlsx/.xlsm. New dependencies: olefile>=0.47, xlrd>=2.0.1
qin-ctx
reviewed
Mar 16, 2026
openviking/parse/parsers/excel.py
Outdated
| row_data = [] | ||
| for col_idx in range(sheet.ncols): | ||
| cell = sheet.cell(row_idx, col_idx) | ||
| row_data.append(str(cell.value) if cell.value is not None else "") |
Collaborator
There was a problem hiding this comment.
[Bug] xlrd returns raw float serial numbers for date cells (e.g. 44927.0 instead of 2023-01-01). cell.ctype should be checked to handle dates (and booleans) properly.
Suggested fix:
for col_idx in range(sheet.ncols):
cell = sheet.cell(row_idx, col_idx)
if cell.ctype == xlrd.XL_CELL_DATE:
try:
dt = xlrd.xldate_as_tuple(cell.value, wb.datemode)
row_data.append(f"{dt[0]:04d}-{dt[1]:02d}-{dt[2]:02d}")
except Exception:
row_data.append(str(cell.value))
elif cell.ctype == xlrd.XL_CELL_BOOLEAN:
row_data.append("TRUE" if cell.value else "FALSE")
elif cell.value is not None:
row_data.append(str(cell.value))
else:
row_data.append("")
Contributor
Author
There was a problem hiding this comment.
Fixed in dbf4d4c with:
formatting_info=Trueonxlrd.open_workbook()so date cells are properly detected viaXL_CELL_DATE- Added handling for all cell types:
XL_CELL_ERROR(mapped to#DIV/0!,#N/A, etc.),XL_CELL_BOOLEAN,XL_CELL_BLANK/EMPTY - Date formatting now includes time component when non-zero
- Integers display without trailing
.0 - Added
on_demand=True + release_resources()for memory efficiency
Also hardened the .doc parser with stream size caps, FIB version check, and bounds validation.
Check cell.ctype for XL_CELL_DATE and XL_CELL_BOOLEAN to avoid outputting raw float serial numbers for dates and numeric 0/1 for booleans.
excel.py: - Enable formatting_info=True so xlrd detects date cells properly - Add on_demand=True and release_resources() for memory efficiency - Handle all xlrd cell types: DATE (with time), BOOLEAN, ERROR, BLANK, EMPTY - Display integers without trailing .0 - Extract cell formatting to _format_xls_cell static method legacy_doc.py: - Add 50MB stream size cap to prevent DoS from crafted files - Cap ccpText at 10M chars to prevent memory exhaustion - Add FIB version check (require Word 97+ / nFib >= 0x00C1) - Add minimum buffer length check before struct.unpack_from - Fix Grpprl skip loop to prevent spin on zero-length entries - Add _clean_word_text for \x0B (soft break) and \x0C (section break) - Log warnings for pieces extending beyond stream bounds - Cap fallback extract to max stream size
qin-ctx
approved these changes
Mar 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
LegacyDocParserfor Word 97-2003.docfiles usingolefile(OLE2 binary format parsing with piece table support and multi-level fallbacks)ExcelParserto handle legacy.xlsfiles usingxlrd, while keepingopenpyxlfor.xlsx/.xlsmolefile>=0.47,xlrd>=2.0.1Motivation
Directories containing legacy
.docand.xlsfiles currently fail withUnsupportedDirectoryFilesErrororopenpyxl does not support the old .xls file formaterrors. These formats are still common in industrial/enterprise environments.Changes
openviking/parse/parsers/legacy_doc.py— New parser for.docfilesopenviking/parse/parsers/excel.py— Added_convert_xls_to_markdown()method using xlrdopenviking/parse/registry.py— RegisteredLegacyDocParserfor.docextensionpyproject.toml— AddedolefileandxlrddependenciesTest plan
.docfiles parse correctly (Word 97-2003 binary format).xlsfiles parse correctly (Excel 97-2003 binary format).docxand.xlsxparsing unchanged22 passed)UnsupportedDirectoryFilesErrorfor.doc/.xlsfiles