sharepoint-to-text

sharepoint-to-text is a typed, pure-Python library for extracting text and structured content from file types commonly found in SharePoint and document-management workflows.

It is built for software engineers who need one extraction interface across modern Microsoft Office files, legacy Office files, OpenDocument, PDF, email, HTML-like content, plain-text formats, and archives.

Why This Package Exists

Document ingestion pipelines usually fail in one of two ways:

they only support a narrow set of office formats
they require heavyweight external runtimes such as LibreOffice, Java, or platform-specific tooling

sharepoint-to-text takes a different approach:

Pure Python library API and CLI
One routing layer for many file types
Works with file paths and in-memory bytes
Typed extraction objects with metadata, units, images, and tables
Suitable for indexing, RAG, ETL, compliance review, and migration tooling

At A Glance

Item	Details
Python	`>=3.10`
Install	`uv add sharepoint-to-text`
Runtime model	Pure Python
Primary interfaces	`read_file(...)`, `read_bytes(...)`, `read_many(...)`, CLI
Output model	Generator of typed extraction objects
SharePoint access	Optional `sharepoint_io` helper for Graph-backed listing/download

Who This Is For

This package is a good fit if you need to:

normalize text extraction across many enterprise document formats
process documents from disk, APIs, queues, or object storage
preserve some document structure for downstream chunking or citations
run extraction in Python-only environments such as services, workers, or serverless jobs

It is not a full document rendering engine, OCR system, or layout-preserving conversion tool.

Installation

Package install

uv add sharepoint-to-text

With pip:

pip install sharepoint-to-text

Development install

git clone https://github.com/Horsmann/sharepoint-to-text.git
cd sharepoint-to-text
uv sync --all-groups

Quick Start

Extract from a local file

import sharepoint2text

result = next(sharepoint2text.read_file("document.docx"))
print(result.get_full_text())

Extract from in-memory bytes

import sharepoint2text

payload = b"hello from memory"
result = next(sharepoint2text.read_bytes(payload, extension="txt"))
print(result.get_full_text())

Use structural units for chunking

import sharepoint2text

result = next(sharepoint2text.read_file("report.pdf", ignore_images=True))

for unit in result.iterate_units():
    print(unit.get_text())
    print(unit.get_metadata())

Batch extraction from a folder

import sharepoint2text

# Extract only Word and PDF files from a folder
for result in sharepoint2text.read_many("docs", suffixes=[".docx", ".pdf"]):
    print(result.get_metadata().file_path)
    print(result.get_full_text()[:200])

Extract all supported files from a folder

import sharepoint2text

# Extract all supported file formats recursively
for result in sharepoint2text.read_many("docs", extract_all_supported=True):
    for unit in result.iterate_units(ignore_images=True):
        text = unit.get_text().strip()
        if text:
            print(result.get_metadata().file_path, text[:120])

Core API

Main entry points

import sharepoint2text

sharepoint2text.read_file(
    path,
    max_file_size=100 * 1024 * 1024,
    ignore_images=False,
    force_plain_text=False,
    include_attachments=True,
)

sharepoint2text.read_bytes(
    data,
    extension="pdf",      # or ".pdf"
    mime_type=None,        # for example "application/pdf"
    max_file_size=100 * 1024 * 1024,
    ignore_images=False,
    force_plain_text=False,
    include_attachments=True,
)

sharepoint2text.read_many(
    folder_path,
    suffixes=[".docx", ".pdf"],  # list of extensions to extract
    extract_all_supported=False,  # or True to extract all supported formats
    max_file_size=100 * 1024 * 1024,
    ignore_images=False,
    force_plain_text=False,
    include_attachments=True,
    recursive=True,               # traverse subdirectories
)

sharepoint2text.is_supported_file(path)
sharepoint2text.get_extractor(path)

Batch extraction with `read_many`

The read_many function extracts content from multiple files in a folder:

Parameter	Description
`folder_path`	Path to the folder to traverse
`suffixes`	List of file extensions to extract (e.g., `[".docx", ".pdf"]`)
`extract_all_supported`	If `True`, extract all supported formats (mutually exclusive with `suffixes`)
`recursive`	If `True` (default), traverse subdirectories

Configuration rules:

You must specify either suffixes or extract_all_supported=True
Specifying both raises InvalidConfigurationError
Suffixes are normalized (with or without leading dot)
Extraction continues on errors, logging warnings for failed files

Result model

All extracted results implement a common interface:

get_full_text()
iterate_units()
iterate_images()
iterate_tables()
get_metadata()
to_json() / from_json(...)

Use get_full_text() when you want one string per extraction result.

Use iterate_units() when you want coarse structural chunks such as:

one page per PDF unit
one slide per presentation unit
one sheet per spreadsheet unit
one document-level unit for most text-document formats

Generator semantics matter

The API returns generators because some inputs can produce multiple results:

archives can yield one result per supported member file
.mbox can yield one result per email
email extraction can recursively expose supported attachments

For single-document formats, next(...) is usually the simplest call pattern.

CLI

The package installs a sharepoint2text command.

Single file extraction

Plain text output:

sharepoint2text --file /path/to/file.docx

Full extraction objects as JSON:

sharepoint2text --file /path/to/file.docx --json

Per-unit JSON:

sharepoint2text --file /path/to/file.pdf --json-unit

Folder extraction

Extract all supported files from a folder:

sharepoint2text --folder /path/to/folder

Extract only specific file types:

sharepoint2text --folder /path/to/folder --suffixes .docx,.pdf,.txt

Non-recursive (top-level only):

sharepoint2text --folder /path/to/folder --no-recursive

Folder output (mirrored structure)

When extracting from a folder, output to another folder to preserve the directory structure:

# Write each file separately to output folder
sharepoint2text --folder /input/docs --output /output/extracted/

# The output structure mirrors the input:
# /input/docs/report.docx      -> /output/extracted/report.txt
# /input/docs/sub/data.xlsx    -> /output/extracted/sub/data.txt

Output path behavior:

If --output is an existing directory, files are written separately
If --output is a new path without extension, it's created as a directory
If --output has a file extension, all results are combined into that file

CLI options

Option	Description
`--file FILE`, `-f FILE`	Path to a single file to extract
`--folder FOLDER`, `-d FOLDER`	Path to a folder to extract files from (recursive by default)
`--suffixes SUFFIXES`, `-s SUFFIXES`	Comma-separated file suffixes to filter (e.g., `.docx,.pdf`). Only with `--folder`. If omitted, extracts all supported types.
`--no-recursive`	Only extract top-level files (no subdirectories). Only with `--folder`.
`--output PATH`, `-o PATH`	Output path: file (combined) or folder (separate files mirroring input structure)
`--json`, `-j`	Emit `list[extraction_object]`
`--json-unit`, `-u`	Emit `list[unit_object]`
`--include-images`, `-i`	Include base64 image payloads in JSON output
`--no-attachments`, `-n`	Skip expanding supported email attachments
`--max-file-size-mb`, `-m`	Maximum input size in MiB, default `100`, use `0` to disable
`--version`, `-v`	Print CLI version

Important CLI rules:

--file and --folder are mutually exclusive (one is required)
--suffixes and --no-recursive only work with --folder
--json and --json-unit are mutually exclusive
--include-images requires --json or --json-unit
the CLI enforces the same file-size guard as the Python API

Supported Formats

Microsoft Office

Modern: .docx, .docm, .xlsx, .xlsm, .xlsb, .pptx, .pptm
Legacy: .doc, .dot, .xls, .xlt, .ppt, .pot, .pps, .rtf
Alias mapping: .dotx, .dotm, .xltx, .xltm, .potx, .potm, .ppsx, .ppsm

OpenDocument

.odt, .ods, .odp, .odg, .odf
Alias mapping: .ott, .ots, .otp

Email

.eml, .msg, .mbox
.eml and .msg can parse and expose supported attachments
.mbox yields one result per message

Plain text and data-like formats

.txt, .md, .csv, .tsv, .json
.yaml, .yml, .xml, .log, .ini, .cfg, .conf, .properties

Web and ebook

.html, .htm, .mhtml, .mht, .epub

PDF

.pdf

SharePoint Integration

The extraction library works independently of SharePoint. The optional sharepoint_io module is a separate helper layer for listing and downloading files through Microsoft Graph before extraction.

import io

import sharepoint2text
from sharepoint2text.sharepoint_io import (
    EntraIDAppCredentials,
    SharePointRestClient,
)

credentials = EntraIDAppCredentials(
    tenant_id="your-tenant-id",
    client_id="your-client-id",
    client_secret="your-client-secret",
)

client = SharePointRestClient(
    site_url="https://contoso.sharepoint.com/sites/Documents",
    credentials=credentials,
)

for file_meta in client.list_all_files():
    data = client.download_file(file_meta.id)
    extractor = sharepoint2text.get_extractor(file_meta.name)
    for result in extractor(io.BytesIO(data), path=file_meta.name):
        print(result.get_full_text()[:200])

Setup details live in sharepoint2text/sharepoint_io/SETUP.md.

Operational Constraints

These are the points an engineering team usually needs before adopting the package:

No OCR: scanned-image PDFs will often produce little or no text
No external office renderer: output is extraction-oriented, not fidelity-oriented
Word-like formats do not expose reliable page boundaries
Nested archives are intentionally skipped
Password-protected or encrypted inputs raise extraction errors
Large files and highly compressed archives are guarded by size limits and zip-bomb protections

Archive behavior

archives are processed one level deep
supported non-archive files inside an archive can yield extraction results
nested archives are skipped as a safety measure
7z extraction is capped at 100 MB internally

Performance guidance

set ignore_images=True when image payloads are not needed
use iterate_units() for chunk-wise downstream processing instead of materializing one large string when structure matters
keep size limits enabled unless you trust the input source

Failure Modes and Exceptions

Common exceptions:

ExtractionFileFormatNotSupportedError
ExtractionFileEncryptedError
ExtractionFileTooLargeError
ExtractionLegacyMicrosoftParsingError
ExtractionZipBombError
ExtractionPathTraversalError
ExtractionFailedError
InvalidConfigurationError (for read_many with conflicting options)

If you are integrating this into a service, see doc/integration-guide.md and doc/troubleshooting.md.

Serialization

import json

import sharepoint2text

result = next(sharepoint2text.read_file("document.docx"))
payload = result.to_json()

print(json.dumps(payload))

Restore from JSON:

from sharepoint2text.parsing.extractors.data_types import ExtractionInterface

restored = ExtractionInterface.from_json(payload)

Additional Documentation

doc/cli.md: complete CLI reference with examples
doc/direct-extractors.md: call format-specific extractors directly and work with concrete result attributes
doc/format-matrix.md: per-format behavior, units, and caveats
doc/improvements.md: roadmap and improvement ideas
CONTRIBUTING.md: contributor workflow
CHANGELOG.md: release history

License

Apache 2.0. See LICENSE.

Disclaimer

This project is not affiliated with, endorsed by, or sponsored by Microsoft.

Name		Name	Last commit message	Last commit date
Latest commit History 341 Commits
.github/workflows		.github/workflows
doc		doc
sharepoint2text		sharepoint2text
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

sharepoint-to-text

Why This Package Exists

At A Glance

Who This Is For

Installation

Package install

Development install

Quick Start

Extract from a local file

Extract from in-memory bytes

Use structural units for chunking

Batch extraction from a folder

Extract all supported files from a folder

Core API

Main entry points

Batch extraction with read_many

Result model

Generator semantics matter

CLI

Single file extraction

Folder extraction

Folder output (mirrored structure)

CLI options

Supported Formats

Microsoft Office

OpenDocument

Email

Plain text and data-like formats

Web and ebook

PDF

Archives

SharePoint Integration

Operational Constraints

Archive behavior

Performance guidance

Failure Modes and Exceptions

Serialization

Additional Documentation

License

Disclaimer

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 15

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Batch extraction with `read_many`

Packages