Go Back
Reference Scanning Code in Powershell

OOXML Text Scanning & XPath Guide

Overview: Common Structure

All OOXML files are ZIP containers: *.docx, *.xlsx, *.pptx.

What to Scan & Caveats

DOCX (Word)

Core Text Part Paths

Tags to Read

Representative XPath

w = http://schemas.openxmlformats.org/wordprocessingml/2006/main
a = http://schemas.openxmlformats.org/drawingml/2006/main

Practice Tips

PPTX (PowerPoint)

Core Text Part Paths

Tags to Read

Representative XPath

p = http://schemas.openxmlformats.org/presentationml/2006/main
a = http://schemas.openxmlformats.org/drawingml/2006/main
c = http://schemas.openxmlformats.org/drawingml/2006/chart

Practice Tips

XLSX (Excel)

Core Text Part Paths

String Storage Rules

Tags to Read

Representative XPath

s = http://schemas.openxmlformats.org/spreadsheetml/2006/main
a = http://schemas.openxmlformats.org/drawingml/2006/main
c = http://schemas.openxmlformats.org/drawingml/2006/chart

Practice Tips

Parser Tips (PowerShell/XPath)

Namespace Strategies

1) Namespace Manager (canonical)

$ns = New-Object System.Xml.XmlNamespaceManager($xml.NameTable)
$ns.AddNamespace('w','http://schemas.openxmlformats.org/wordprocessingml/2006/main')
$ns.AddNamespace('a','http://schemas.openxmlformats.org/drawingml/2006/main')
$ns.AddNamespace('s','http://schemas.openxmlformats.org/spreadsheetml/2006/main')

# Typical text nodes across parts
$nodes = $xml.SelectNodes('//w:t | //a:t | //s:si/s:t | //s:si/s:r/s:t', $ns)

2) local-name() + namespace-uri() (quick & robust)

# Word plain text
$xml.SelectNodes('//*[local-name()="t" and namespace-uri()="http://schemas.openxmlformats.org/wordprocessingml/2006/main"]')

# DrawingML text (Word/PPTX/Excel drawings)
$xml.SelectNodes('//*[local-name()="t" and namespace-uri()="http://schemas.openxmlformats.org/drawingml/2006/main"]')

Merging & Cleanup

High-Risk Auxiliary Parts

Minimum Scan Checklist

DOCX

PPTX

XLSX

Common Pitfalls