sharepoint-to-text is a typed, pure-Python library for extracting text and structured content from file types commonly found in SharePoint and document-management workflows.
It is built for software engineers who need one extraction interface across modern Microsoft Office files, legacy Office files, OpenDocument, PDF, email, HTML-like content, plain-text formats, and archives.
Document ingestion pipelines usually fail in one of two ways:
- they only support a narrow set of office formats
- they require heavyweight external runtimes such as LibreOffice, Java, or platform-specific tooling
sharepoint-to-text takes a different approach:
- Pure Python library API and CLI
- One routing layer for many file types
- Works with file paths and in-memory bytes
- Typed extraction objects with metadata, units, images, and tables
- Suitable for indexing, RAG, ETL, compliance review, and migration tooling
| Item | Details |
|---|---|
| Python | >=3.10 |
| Install | uv add sharepoint-to-text |
| Runtime model | Pure Python |
| Primary interfaces | read_file(...), read_bytes(...), read_many(...), CLI |
| Output model | Generator of typed extraction objects |
| SharePoint access | Optional sharepoint_io helper for Graph-backed listing/download |
This package is a good fit if you need to:
- normalize text extraction across many enterprise document formats
- process documents from disk, APIs, queues, or object storage
- preserve some document structure for downstream chunking or citations
- run extraction in Python-only environments such as services, workers, or serverless jobs
It is not a full document rendering engine, OCR system, or layout-preserving conversion tool.
uv add sharepoint-to-textWith pip:
pip install sharepoint-to-textgit clone https://github.com/Horsmann/sharepoint-to-text.git
cd sharepoint-to-text
uv sync --all-groupsimport sharepoint2text
result = next(sharepoint2text.read_file("document.docx"))
print(result.get_full_text())import sharepoint2text
payload = b"hello from memory"
result = next(sharepoint2text.read_bytes(payload, extension="txt"))
print(result.get_full_text())import sharepoint2text
result = next(sharepoint2text.read_file("report.pdf", ignore_images=True))
for unit in result.iterate_units():
print(unit.get_text())
print(unit.get_metadata())import sharepoint2text
# Extract only Word and PDF files from a folder
for result in sharepoint2text.read_many("docs", suffixes=[".docx", ".pdf"]):
print(result.get_metadata().file_path)
print(result.get_full_text()[:200])import sharepoint2text
# Extract all supported file formats recursively
for result in sharepoint2text.read_many("docs", extract_all_supported=True):
for unit in result.iterate_units(ignore_images=True):
text = unit.get_text().strip()
if text:
print(result.get_metadata().file_path, text[:120])import sharepoint2text
sharepoint2text.read_file(
path,
max_file_size=100 * 1024 * 1024,
ignore_images=False,
force_plain_text=False,
include_attachments=True,
)
sharepoint2text.read_bytes(
data,
extension="pdf", # or ".pdf"
mime_type=None, # for example "application/pdf"
max_file_size=100 * 1024 * 1024,
ignore_images=False,
force_plain_text=False,
include_attachments=True,
)
sharepoint2text.read_many(
folder_path,
suffixes=[".docx", ".pdf"], # list of extensions to extract
extract_all_supported=False, # or True to extract all supported formats
max_file_size=100 * 1024 * 1024,
ignore_images=False,
force_plain_text=False,
include_attachments=True,
recursive=True, # traverse subdirectories
)
sharepoint2text.is_supported_file(path)
sharepoint2text.get_extractor(path)The read_many function extracts content from multiple files in a folder:
| Parameter | Description |
|---|---|
folder_path |
Path to the folder to traverse |
suffixes |
List of file extensions to extract (e.g., [".docx", ".pdf"]) |
extract_all_supported |
If True, extract all supported formats (mutually exclusive with suffixes) |
recursive |
If True (default), traverse subdirectories |
Configuration rules:
- You must specify either
suffixesorextract_all_supported=True - Specifying both raises
InvalidConfigurationError - Suffixes are normalized (with or without leading dot)
- Extraction continues on errors, logging warnings for failed files
All extracted results implement a common interface:
get_full_text()iterate_units()iterate_images()iterate_tables()get_metadata()to_json()/from_json(...)
Use get_full_text() when you want one string per extraction result.
Use iterate_units() when you want coarse structural chunks such as:
- one page per PDF unit
- one slide per presentation unit
- one sheet per spreadsheet unit
- one document-level unit for most text-document formats
The API returns generators because some inputs can produce multiple results:
- archives can yield one result per supported member file
.mboxcan yield one result per email- email extraction can recursively expose supported attachments
For single-document formats, next(...) is usually the simplest call pattern.
The package installs a sharepoint2text command.
Plain text output:
sharepoint2text --file /path/to/file.docxFull extraction objects as JSON:
sharepoint2text --file /path/to/file.docx --jsonPer-unit JSON:
sharepoint2text --file /path/to/file.pdf --json-unitExtract all supported files from a folder:
sharepoint2text --folder /path/to/folderExtract only specific file types:
sharepoint2text --folder /path/to/folder --suffixes .docx,.pdf,.txtNon-recursive (top-level only):
sharepoint2text --folder /path/to/folder --no-recursiveWhen extracting from a folder, output to another folder to preserve the directory structure:
# Write each file separately to output folder
sharepoint2text --folder /input/docs --output /output/extracted/
# The output structure mirrors the input:
# /input/docs/report.docx -> /output/extracted/report.txt
# /input/docs/sub/data.xlsx -> /output/extracted/sub/data.txtOutput path behavior:
- If
--outputis an existing directory, files are written separately - If
--outputis a new path without extension, it's created as a directory - If
--outputhas a file extension, all results are combined into that file
| Option | Description |
|---|---|
--file FILE, -f FILE |
Path to a single file to extract |
--folder FOLDER, -d FOLDER |
Path to a folder to extract files from (recursive by default) |
--suffixes SUFFIXES, -s SUFFIXES |
Comma-separated file suffixes to filter (e.g., .docx,.pdf). Only with --folder. If omitted, extracts all supported types. |
--no-recursive |
Only extract top-level files (no subdirectories). Only with --folder. |
--output PATH, -o PATH |
Output path: file (combined) or folder (separate files mirroring input structure) |
--json, -j |
Emit list[extraction_object] |
--json-unit, -u |
Emit list[unit_object] |
--include-images, -i |
Include base64 image payloads in JSON output |
--no-attachments, -n |
Skip expanding supported email attachments |
--max-file-size-mb, -m |
Maximum input size in MiB, default 100, use 0 to disable |
--version, -v |
Print CLI version |
Important CLI rules:
--fileand--folderare mutually exclusive (one is required)--suffixesand--no-recursiveonly work with--folder--jsonand--json-unitare mutually exclusive--include-imagesrequires--jsonor--json-unit- the CLI enforces the same file-size guard as the Python API
- Modern:
.docx,.docm,.xlsx,.xlsm,.xlsb,.pptx,.pptm - Legacy:
.doc,.dot,.xls,.xlt,.ppt,.pot,.pps,.rtf - Alias mapping:
.dotx,.dotm,.xltx,.xltm,.potx,.potm,.ppsx,.ppsm
.odt,.ods,.odp,.odg,.odf- Alias mapping:
.ott,.ots,.otp
.eml,.msg,.mbox.emland.msgcan parse and expose supported attachments.mboxyields one result per message
.txt,.md,.csv,.tsv,.json.yaml,.yml,.xml,.log,.ini,.cfg,.conf,.properties
.html,.htm,.mhtml,.mht,.epub
.pdf
.zip,.tar,.7z.tar.gz,.tgz,.tar.bz2,.tbz2,.tar.xz,.txz
For a behavior-focused view of units, attachments, and caveats by format, see doc/format-matrix.md.
The extraction library works independently of SharePoint. The optional sharepoint_io module is a separate helper layer for listing and downloading files through Microsoft Graph before extraction.
import io
import sharepoint2text
from sharepoint2text.sharepoint_io import (
EntraIDAppCredentials,
SharePointRestClient,
)
credentials = EntraIDAppCredentials(
tenant_id="your-tenant-id",
client_id="your-client-id",
client_secret="your-client-secret",
)
client = SharePointRestClient(
site_url="https://contoso.sharepoint.com/sites/Documents",
credentials=credentials,
)
for file_meta in client.list_all_files():
data = client.download_file(file_meta.id)
extractor = sharepoint2text.get_extractor(file_meta.name)
for result in extractor(io.BytesIO(data), path=file_meta.name):
print(result.get_full_text()[:200])Setup details live in sharepoint2text/sharepoint_io/SETUP.md.
These are the points an engineering team usually needs before adopting the package:
- No OCR: scanned-image PDFs will often produce little or no text
- No external office renderer: output is extraction-oriented, not fidelity-oriented
- Word-like formats do not expose reliable page boundaries
- Nested archives are intentionally skipped
- Password-protected or encrypted inputs raise extraction errors
- Large files and highly compressed archives are guarded by size limits and zip-bomb protections
- archives are processed one level deep
- supported non-archive files inside an archive can yield extraction results
- nested archives are skipped as a safety measure
- 7z extraction is capped at 100 MB internally
- set
ignore_images=Truewhen image payloads are not needed - use
iterate_units()for chunk-wise downstream processing instead of materializing one large string when structure matters - keep size limits enabled unless you trust the input source
Common exceptions:
ExtractionFileFormatNotSupportedErrorExtractionFileEncryptedErrorExtractionFileTooLargeErrorExtractionLegacyMicrosoftParsingErrorExtractionZipBombErrorExtractionPathTraversalErrorExtractionFailedErrorInvalidConfigurationError(forread_manywith conflicting options)
If you are integrating this into a service, see doc/integration-guide.md and doc/troubleshooting.md.
import json
import sharepoint2text
result = next(sharepoint2text.read_file("document.docx"))
payload = result.to_json()
print(json.dumps(payload))Restore from JSON:
from sharepoint2text.parsing.extractors.data_types import ExtractionInterface
restored = ExtractionInterface.from_json(payload)- doc/cli.md: complete CLI reference with examples
- doc/direct-extractors.md: call format-specific extractors directly and work with concrete result attributes
- doc/format-matrix.md: per-format behavior, units, and caveats
- doc/improvements.md: roadmap and improvement ideas
- CONTRIBUTING.md: contributor workflow
- CHANGELOG.md: release history
Apache 2.0. See LICENSE.
This project is not affiliated with, endorsed by, or sponsored by Microsoft.