DocxSmith - The Document Forge

A powerful and elegant Go library and CLI tool for manipulating .docx and .pdf files

Features

DOCX Support

Create new .docx documents from scratch
Read and parse existing .docx files
Modify document content programmatically
Add paragraphs with rich formatting (bold, italic, colors, sizes)
Delete paragraphs or ranges of content
Find and replace text throughout documents
Tables support (create, modify, delete)
Images support (add, insert, resize)
Headers & Footers support (default, first page, even page)
Extract text content from documents

PDF Support ✨ NEW

Create new PDF documents from scratch
Read and parse existing PDF files
Add text content with styling (bold, italic, colors, sizes)
Extract text from PDFs
Tables support in PDF generation
Metadata management (title, author, subject)

Format Conversion

Convert DOCX to PDF with formatting preservation
Convert PDF to DOCX for editing
External tool support (LibreOffice, Pandoc) for production-quality conversion
Built-in converters as fallback for simple documents

Additional Features

CLI tool for command-line operations
Scalable architecture for easy extension
Well-tested with comprehensive test coverage

Installation

As a Library

go get github.com/Palaciodiego008/docxsmith

As a CLI Tool

go install github.com/Palaciodiego008/docxsmith/cmd/docxsmith@latest

Or build from source:

git clone https://github.com/Palaciodiego008/docxsmith.git
cd docxsmith
go build -o docxsmith ./cmd/docxsmith

PDF Conversion Setup

DocxSmith supports three conversion modes:

LibreOffice (Recommended) - Best quality, handles complex formatting
Pandoc - Fast, good for simple documents
Built-in - Fallback for basic conversions (limited formatting)

The tool automatically detects and uses the best available converter.

Installing External Tools (Recommended)

LibreOffice (Best Quality):

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install libreoffice-writer

# macOS
brew install libreoffice

# Arch Linux
sudo pacman -S libreoffice-fresh

Pandoc (Fast Alternative):

# Ubuntu/Debian
sudo apt-get install pandoc

# macOS
brew install pandoc

# Arch Linux
sudo pacman -S pandoc

Check Installation:

# Verify tools are available
which libreoffice
which pandoc

# Or use the system check script
./check_system.sh

Conversion Quality Comparison

Method	DOCX→PDF	PDF→DOCX	Large Files	Complex Formatting
LibreOffice	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	✅	✅
Pandoc	⭐⭐⭐⭐	⭐⭐⭐	✅	⚠️
Built-in	⭐⭐	⭐	❌	❌

Troubleshooting Conversions

"Process killed" error:

Install LibreOffice for better memory handling
Or reduce file size before conversion

PDF to DOCX produces empty file:

PDF may be scanned images (no text layer)

Install OCR tool first:

sudo apt-get install ocrmypdf
ocrmypdf input.pdf output.pdf
./docxsmith convert -input output.pdf -output document.docx

"libreoffice not found" but it's installed:

Add to PATH (macOS):

export PATH="/Applications/LibreOffice.app/Contents/MacOS:$PATH"

Quick Start

Using as a Library

package main

import (
    "log"
    "github.com/Palaciodiego008/docxsmith/pkg/docx"
)

func main() {
    // Create a new document
    doc := docx.New()

    // Add content
    doc.AddParagraph("Welcome to DocxSmith!")
    doc.AddParagraph("This is bold text", docx.WithBold())
    doc.AddParagraph("This is colored text", docx.WithColor("FF0000"))

    // Add headers and footers
    doc.SetHeader(docx.HeaderTypeDefault, "Company Name", docx.WithHFBold(), docx.WithHFAlignment("center"))
    doc.SetFooter(docx.FooterTypeDefault, "Page {PAGE}", docx.WithHFAlignment("center"))

    // Save the document
    if err := doc.Save("output.docx"); err != nil {
        log.Fatal(err)
    }
}

Using the CLI

DOCX Operations

# Create a new document
docxsmith create -output hello.docx -text "Hello, World!"

# Add content to an existing document
docxsmith add -input hello.docx -output hello2.docx -text "New paragraph" -bold

# Find text in a document
docxsmith find -input hello.docx -text "World"

# Replace text
docxsmith replace -input hello.docx -output hello3.docx -old "World" -new "DocxSmith"

# Extract text
docxsmith extract -input hello.docx

# Create a table
docxsmith table -input hello.docx -output table.docx -create -rows 3 -cols 4

# Add an image
docxsmith image add -input hello.docx -output hello_img.docx -image photo.jpg -width 300 -height 200

# Insert image at specific position
docxsmith image insert -input hello.docx -output hello_img.docx -image logo.png -at 0 -width 150

# Count images in document
docxsmith image count -input document.docx

# Add headers and footers
docxsmith header-footer set-header -input hello.docx -output hello_hf.docx -content "Company Header" -bold -align center
docxsmith header-footer set-footer -input hello.docx -output hello_hf.docx -content "Page {PAGE}" -align center

# List headers and footers
docxsmith header-footer list -input document.docx

PDF Operations ✨

# Create a new PDF
docxsmith pdf-create -output hello.pdf -text "Hello PDF!" -title "My Document"

# Add content to a PDF
docxsmith pdf-add -input hello.pdf -output hello2.pdf -text "New content" -bold -size 14

# Extract text from PDF
docxsmith pdf-extract -input document.pdf

# Get PDF information
docxsmith pdf-info -input document.pdf

Format Conversion

# Convert DOCX to PDF
docxsmith convert -input document.docx -output document.pdf

# Convert PDF to DOCX
docxsmith convert -input document.pdf -output document.docx

# Convert with custom options
docxsmith convert -input doc.docx -output doc.pdf -font-size 14 -font-family "Times"

Library API

Creating Documents

// Create a new empty document
doc := docx.New()

// Create from an existing template
doc, err := docx.CreateFromTemplate("template.docx")

// Open an existing document
doc, err := docx.Open("existing.docx")

Working with Paragraphs

// Add a simple paragraph
doc.AddParagraph("Simple text")

// Add with formatting
doc.AddParagraph("Bold text", docx.WithBold())
doc.AddParagraph("Italic text", docx.WithItalic())
doc.AddParagraph("Colored text", docx.WithColor("0000FF"))
doc.AddParagraph("Large text", docx.WithSize("32"))
doc.AddParagraph("Centered text", docx.WithAlignment("center"))

// Combine multiple options
doc.AddParagraph("Fancy text",
    docx.WithBold(),
    docx.WithItalic(),
    docx.WithColor("FF0000"),
    docx.WithSize("28"))

// Add paragraph at specific position
doc.AddParagraphAt(2, "Inserted text")

// Delete a paragraph
doc.DeleteParagraph(0)

// Delete a range of paragraphs
doc.DeleteParagraphsRange(0, 5)

Text Operations

// Find text in document
indices := doc.FindText("search term")
// Returns slice of paragraph indices where text was found

// Replace all occurrences
count := doc.ReplaceText("old", "new")

// Replace in specific paragraph
doc.ReplaceTextInParagraph(2, "old", "new")

// Get all text content
text := doc.GetText()

// Get text from specific paragraph
text, err := doc.GetParagraphText(0)

Working with Headers and Footers

// Set headers
doc.SetHeader(docx.HeaderTypeDefault, "Company Name", docx.WithHFBold(), docx.WithHFAlignment("center"))
doc.SetHeader(docx.HeaderTypeFirst, "DRAFT", docx.WithHFItalic(), docx.WithHFTextColor("FF0000"))
doc.SetHeader(docx.HeaderTypeEven, "Even Page Header", docx.WithHFAlignment("left"))

// Set footers
doc.SetFooter(docx.FooterTypeDefault, "Page {PAGE} of {NUMPAGES}", docx.WithHFAlignment("center"))
doc.SetFooter(docx.FooterTypeFirst, "© 2024 Company", docx.WithHFAlignment("center"))

// Check if headers/footers exist
hasHeader := doc.HasHeader(docx.HeaderTypeDefault)
hasFooter := doc.HasFooter(docx.FooterTypeDefault)

// Get headers/footers
header, err := doc.GetHeader(docx.HeaderTypeDefault)
footer, err := doc.GetFooter(docx.FooterTypeDefault)

// Remove headers/footers
doc.RemoveHeader(docx.HeaderTypeFirst)
doc.RemoveFooter(docx.FooterTypeFirst)

// Header/Footer types available:
// HeaderTypeDefault, HeaderTypeFirst, HeaderTypeEven
// FooterTypeDefault, FooterTypeFirst, FooterTypeEven

// Formatting options:
// WithHFBold(), WithHFItalic()
// WithHFAlignment("center"), WithHFFontSize("24")
// WithHFTextColor("FF0000"), WithHFFont("Arial")

Working with Images

// Add an image with default size (200x150)
err := doc.AddImage("photo.jpg")

// Add image with custom dimensions
err := doc.AddImage("logo.png", 
    docx.WithImageWidth(300), 
    docx.WithImageHeight(200))

// Insert image at specific paragraph position
err := doc.AddImageAt(2, "banner.png", 
    docx.WithImageWidth(400), 
    docx.WithImageHeight(100))

// Get number of images in document
imageCount := doc.GetImageCount()

// Supported formats: PNG, JPEG, GIF, BMP

Working with Tables

// Create a table
table := doc.AddTable(3, 4) // 3 rows, 4 columns

// Set cell content
table.SetCellText(0, 0, "Header 1")
table.SetCellText(0, 1, "Header 2")

// Get cell content
text, err := table.GetCellText(1, 1)

// Add a row
table.AddRow()

// Delete a row
table.DeleteRow(1)

// Get table dimensions
rows := table.GetRowCount()
cols := table.GetColumnCount()

// Delete entire table
doc.DeleteTable(0)

Document Information

// Get counts
paraCount := doc.GetParagraphCount()
tableCount := doc.GetTableCount()

// Clear all content
doc.Clear()

// Clone document
newDoc := doc.Clone()

Saving Documents

// Save to file
err := doc.Save("output.docx")

// Save to a different file
err := doc.SaveAs("copy.docx")

// Get document as bytes
data, err := doc.ToBytes()

PDF Library API ✨

Creating PDF Documents

import "github.com/Palaciodiego008/docxsmith/pkg/pdf"

// Create a new PDF
pdfDoc := pdf.New()

// Set metadata
pdfDoc.SetMetadata("My Document", "Author Name", "Subject")

// Add a page
page := pdfDoc.AddPage()

// Add text
page.AddText("Hello PDF", 20, 30, 12)

// Add styled text
style := pdf.TextStyle{
    FontSize:   14,
    FontFamily: "Arial",
    Bold:       true,
    Italic:     false,
    Color:      "FF0000", // Red
}
page.AddTextStyled("Important Text", 20, 50, style)

// Save
pdfDoc.Save("output.pdf")

Reading PDF Documents

// Open existing PDF
pdfDoc, err := pdf.Open("document.pdf")

// Get page count
pageCount := pdfDoc.GetPageCount()

// Extract all text
text := pdfDoc.GetAllText()

// Get specific page
page, err := pdfDoc.GetPage(0)
pageText := page.GetText()

Converting Between Formats

import "github.com/Palaciodiego008/docxsmith/pkg/converter"

// Convert DOCX to PDF
opts := converter.DefaultOptions()
opts.FontSize = 12
opts.FontFamily = "Arial"

err := converter.ConvertDocxToPDF("input.docx", "output.pdf", opts)

// Convert PDF to DOCX
err := converter.ConvertPDFToDocx("input.pdf", "output.docx", opts)

CLI Commands

create - Create a new document

docxsmith create -output file.docx [-text "content"]

Options:

-output: Output file path (required)
-text: Initial text content (optional)

add - Add content

docxsmith add -input in.docx -output out.docx -text "content" [options]

Options:

-input: Input file path (required)
-output: Output file path (required)
-text: Text to add (required)
-at: Insert at specific index (optional)
-bold: Make text bold
-italic: Make text italic
-size: Font size (e.g., "24" for 12pt)
-color: Text color (hex without #)
-align: Alignment (left, center, right, both)

delete - Delete content

docxsmith delete -input in.docx -output out.docx [options]

Options:

-input: Input file path (required)
-output: Output file path (required)
-paragraph: Paragraph index to delete
-start & -end: Delete range of paragraphs
-table: Table index to delete

replace - Replace text

docxsmith replace -input in.docx -output out.docx -old "text" -new "replacement"

Options:

-input: Input file path (required)
-output: Output file path (required)
-old: Text to replace (required)
-new: Replacement text (required)
-paragraph: Only replace in specific paragraph

find - Find text

docxsmith find -input file.docx -text "search"

Options:

-input: Input file path (required)
-text: Text to find (required)

extract - Extract text

docxsmith extract -input file.docx [-output text.txt]

Options:

-input: Input file path (required)
-output: Output text file (optional, prints to stdout if omitted)

table - Table operations

docxsmith table -input in.docx -output out.docx [options]

Options:

-input: Input file path (required)
-output: Output file path (required)
-create: Create a new table
-rows: Number of rows (default: 2)
-cols: Number of columns (default: 2)
-set: Set cell text (format: "tableIdx,row,col,text")

info - Document information

docxsmith info -input file.docx

Options:

-input: Input file path (required)

clear - Clear all content

docxsmith clear -input in.docx -output out.docx

Options:

-input: Input file path (required)
-output: Output file path (required)

Examples

See the examples directory for more comprehensive examples:

# Run the basic usage example
cd examples
go run basic_usage.go

This will generate several example documents demonstrating various features.

Testing

Run the test suite:

go test ./...

Run tests with coverage:

go test -cover ./...

Run tests with verbose output:

go test -v ./pkg/docx

Project Structure

docxsmith/
├── cmd/
│   └── docxsmith/          # CLI entry point
│       └── main.go         # Minimal main function
├── internal/
│   └── cli/                # CLI command implementations
│       ├── cli.go          # CLI router and usage
│       ├── create.go       # Create command
│       ├── content.go      # Add, delete, clear commands
│       ├── text.go         # Find, replace, extract commands
│       ├── table.go        # Table operations
│       └── info.go         # Info command
├── pkg/
│   └── docx/               # Core library (public API)
│       ├── document.go     # Document structure
│       ├── reader.go       # Reading .docx files
│       ├── writer.go       # Writing .docx files
│       ├── operations.go   # Document operations
│       ├── table.go        # Table operations
│       ├── creator.go      # Document creation
│       ├── *_test.go       # Tests
├── examples/               # Usage examples
├── testdata/               # Test fixtures
├── go.mod
└── README.md

How It Works

.docx files are actually ZIP archives containing XML files. DocxSmith:

Unzips the .docx file
Parses the XML content (mainly word/document.xml)
Manipulates the XML structure
Serializes back to XML
Repackages as a ZIP file with .docx extension

The library handles all the complexity of the Office Open XML format while providing a simple, intuitive API.

Limitations

Currently focuses on document content (paragraphs, tables, images, headers/footers)
Advanced features like charts and complex shapes are not yet supported
Complex formatting and styles have limited support
Does not preserve all metadata from original documents

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

MIT License - feel free to use this project for any purpose.

Author

Diego Palacio (@Palaciodiego008)

Acknowledgments

Built with Go's standard library
Inspired by the need for simple .docx manipulation
Name inspired by blacksmiths who forge powerful tools

DocxSmith - Forging documents with precision and elegance.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.githooks		.githooks
.github		.github
assets		assets
cmd/docxsmith		cmd/docxsmith
docs		docs
examples		examples
internal/cli		internal/cli
pkg		pkg
testdata		testdata
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
USAGE.md		USAGE.md
check_system.sh		check_system.sh
docxsmith		docxsmith
go.mod		go.mod
go.sum		go.sum
install.sh		install.sh
test_conversion.sh		test_conversion.sh

Folders and files

Latest commit

History

Repository files navigation

DocxSmith - The Document Forge

Features

DOCX Support

PDF Support ✨ NEW

Format Conversion

Additional Features

Installation

As a Library

As a CLI Tool

PDF Conversion Setup

Installing External Tools (Recommended)

Conversion Quality Comparison

Troubleshooting Conversions

Quick Start

Using as a Library

Using the CLI

DOCX Operations

PDF Operations ✨

Format Conversion

Library API

Creating Documents

Working with Paragraphs

Text Operations

Working with Headers and Footers

Working with Images

Working with Tables

Document Information

Saving Documents

PDF Library API ✨

Creating PDF Documents

Reading PDF Documents

Converting Between Formats

CLI Commands

create - Create a new document

add - Add content

delete - Delete content

replace - Replace text

find - Find text

extract - Extract text

table - Table operations

info - Document information

clear - Clear all content

Examples

Testing

Project Structure

How It Works

Limitations

Contributing

License

Author

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages