How to view metadata in MS Word files. How to remove and edit Word metadata
Metadata in Word files
The MS Word file and, in general, all MS Office documents contain a lot of metadata.
And further:
If you need to extract metadata from MS Word files without opening the file in Word, you can use special utilities.
Actually the popular mat tool, which is used to display and clean up metadata, does not work very well with MS Word files:
mat -d file2.docx
Example output:
[+] File file2.docx : Harmful metadata found: customXml/item1.xml's zipinfo: {'system': 'unknown'} docProps/core.xml's zipinfo: {'system': 'unknown'} docProps/app.xml's zipinfo: {'system': 'unknown'} word/document.xml's zipinfo: {'system': 'unknown'} [Content_Types].xml's zipinfo: {'system': 'unknown'} word/theme/theme1.xml's zipinfo: {'system': 'unknown'} customXml/itemProps1.xml's zipinfo: {'system': 'unknown'} _rels/.rels's zipinfo: {'system': 'unknown'} customXml/_rels/item1.xml.rels's zipinfo: {'system': 'unknown'} word/footnotes.xml's zipinfo: {'system': 'unknown'} word/header1.xml's zipinfo: {'system': 'unknown'} word/_rels/document.xml.rels's zipinfo: {'system': 'unknown'} word/webSettings.xml's zipinfo: {'system': 'unknown'} word/styles.xml's zipinfo: {'system': 'unknown'} docProps/core.xml: harmful content word/numbering.xml's zipinfo: {'system': 'unknown'} word/fontTable.xml's zipinfo: {'system': 'unknown'} word/endnotes.xml's zipinfo: {'system': 'unknown'} word/settings.xml's zipinfo: {'system': 'unknown'} docProps/app.xml: harmful content
In fact, little is clear and the information shown seems more like garbage. But there are useful data, these are lines:
docProps/core.xml: harmful content docProps/app.xml: harmful content
They say that the docProps/core.xml and docProps/app.xml files contain potentially harmful content. But we cannot see the data itself with the help of this program.
If you try to analyze a .docm file (an MS Word document with macro support):
mat -d file2.docm
That program will simply write that it cannot process this file:
[-] Unable to process file2.docm
This is despite the fact that the .docm format has a minimal difference from the .docx format – a couple of additional files inside (with a description of the macros and a second file with the macros themselves).
There is another version of mat2. Let's try it:
mat2 -s file2.docx
Example output:
[+] Metadata for file2.docx: [++] Metadata for [Content_Types].xml: create_system: Weird [++] Metadata for _rels/.rels: create_system: Weird [++] Metadata for customXml/_rels/item1.xml.rels: create_system: Weird [++] Metadata for customXml/item1.xml: create_system: Weird [++] Metadata for customXml/itemProps1.xml: create_system: Weird [++] Metadata for docProps/app.xml: AppVersion: 16.0000 Application: Microsoft Office Word Characters: 275 CharactersWithSpaces: 300 DocSecurity: 0 HeadingPairs: <vt:vector size="2" baseType="variant"><vt:variant><vt:lpstr>Название</vt:lpstr></vt:variant><vt:variant><vt:i4>1</vt:i4></vt:variant></vt:vector> HyperlinksChanged: false Lines: 76 LinksUpToDate: false Pages: 6 Paragraphs: 31 ScaleCrop: false SharedDoc: false Template: Normal TitlesOfParts: <vt:vector size="1" baseType="lpstr"><vt:lpstr></vt:lpstr></vt:vector> TotalTime: 16 Words: 50 create_system: Weird [++] Metadata for docProps/core.xml: cp:lastModifiedBy: MiAl cp:lastPrinted: 2019-07-18T02:58:00Z cp:revision: 9 create_system: Weird dc:creator: Alex [++] Metadata for word/_rels/document.xml.rels: create_system: Weird [++] Metadata for word/document.xml: create_system: Weird [++] Metadata for word/endnotes.xml: create_system: Weird [++] Metadata for word/fontTable.xml: create_system: Weird [++] Metadata for word/footnotes.xml: create_system: Weird [++] Metadata for word/header1.xml: create_system: Weird [++] Metadata for word/numbering.xml: create_system: Weird [++] Metadata for word/settings.xml: create_system: Weird [++] Metadata for word/styles.xml: create_system: Weird [++] Metadata for word/theme/theme1.xml: create_system: Weird [++] Metadata for word/webSettings.xml: create_system: Weird
Everything is much better here, almost all file metadata is displayed.
Let's try to analyze the format .docm:
mat2 -s file2.docm
Again the failure:
[-] file2.docm's format (application/vnd.ms-word.document.macroenabled.12) is not supported
How to view metadata of a .docm file
The mat2 program is not aware that .docm is the same file as .docx. But we know this and can go a very simple way – just change (or add) the file extension to .docx:
cp file2.docm file2.docm.docx
Now the metadata is perfectly extracted:
mat2 -s file2.docm.docx
How to make the output in mat2 more readable
You may notice that the output of the mat2 command mainly consists of the same lines:
create_system: Weird
The output will be much clearer if we simply remove these lines:
mat2 -s file2.docx | grep -v 'Weird'
What mat2 shows
The mat2 program displays the name of the nodes of the XML files that have talking names. They are:
- AppVersion - version of the application
- Application - application
- Characters - total characters
- CharactersWithSpaces - total characters with spaces
- DocSecurity - document security
- HyperlinksChanged - links changed
- Lines - total lines in the document
- LinksUpToDate - links updated
- Pages - total pages in the document
- Paragraphs - total paragraphs in the document
- ScaleCrop - scaling/cropping
- SharedDoc - shared document
- Template - used template
- TitlesOfParts - part names
- TotalTime - total edit time
- Words - quantity of all words in the document
- cp:lastModifiedBy - who last modified the document
- cp:lastPrinted - the date the document was last printed
- cp:revision - total document revisions (number of edits and saves)
- dc:creator - who created the document
How to view MS Office document metadata without additional programs
In fact, a new document format, for example, Word .docx files is a zip archive, which contains mostly xml files (there may also be images, macros, other binary files).
For manual analysis, I created a new file3.docx file and added a picture with GPS coordinates and other metadata there. The mat and mat2 programs showed the presence of the image, but the metadata themselves were not derived from it.
So, you can add the .zip extension to the file3.docx file and then unpack its contents as an archive.
I unpack the file:
unzip file3.docx.zip -d file3
When unpacking, the media files are located in the /word/media/ folder.
The mat2 program could not find any metadata in the image:
mat2 -s file3/word/media/image1.jpeg No metadata found
The mat program also did not find anything:
mat -d file3/word/media/image1.jpeg [+] File file3/word/media/image1.jpeg : No harmful metadata found
Apparently, when inserting images into Word documents, they are re-saved by the program and at the same time all the metadata is lost. But at least we can just open the image:
Document metadata is contained in the docProps/core.xml and docProps/app.xml files. I opened them in NetBeans IDE and, for readability, I chose the option of formatting the document, because in its initial form the entire document is written in one line that is difficult to read.
docProps/core.xml file:
docProps/app.xml file:
The core.xml file contains creation and modification dates that even the mat2 program does not display. Perhaps there are some other fields that can not be seen except by opening these files.
Which file contains macros in Word files
Macro information is saved in the /word/vbaData.xml file, and the macros themselves are saved in /word/vbaProject.bin – this file is binary.
Which file contains the text of the document in Word files
The text of the document is saved to the /word/document.xml file. This document uses special markup based on opening and closing tags and their properties.
How to clear MS Word file metadata
You can clear the metadata of MS Office documents, including Word, right in the editing program itself. The following is an example of Word.
In the menu, click File:
Next, in the Details tab, find the Problem Search button and select Document Inspector from the drop-down menu:
If the document is not saved, then before the analysis you will be asked to save it.
Click the Check button:
Pay attention to the item Document properties and personal data – if you wish, click the Delete all button:
Remove Office File Metadata on Linux
The mat program seems to have successfully removed metadata from the file:
mat file3.docx
This is indicated by the output:
[*] Cleaning file3.docx [+] file3.docx cleaned!
But the resulting file can not be opened in any program…
The mat2 program successfully coped with the task and deleted all the metadata:
mat2 file3.docx
Please note that the file will not be cleared and the file will be created without changes.
So, if you really need to remove the metadata of the .docx file without opening it in Word, the sequence of actions is as follows:
- Add a .zip extension to the file.
- Unpack the received archive.
- Open the docProps/core.xml and docProps/app.xml files and replace the data with the ones you need. After editing, save these files.
- Select all the unpacked directories and files and zip it all into a zip archive.
- Add the .docx extension to the resulting archive.
- It is necessary to check that the document is not damaged and has retained its functionality. To prevent new metadata from being saved in it, just in case, make a copy of the new document and check it.
By the way, this way you can not only delete, but also spoof office documents metadata:
Note the creation, modification, printing dates and revision number:
In PHP you can use ZipArchive class using the default compression method for unpacking and packing MS Word files .docx.
Conclusion
Metadata can contain important information, up to the name of the author of the document, so ones need to pay special attention.
Regarding to displaying and cleaning metadata from MS Office documents with tools like mat and mat2, the first one doesn’t show metadata and breaks the file when it is cleaned, the second one shows them, and successfully clears the file.
The easiest way to clean up the metadata in a Word document and in other office programs is to do it right in the corresponding MS Office editor.
In the next article we will extract, delete and spoof the metadata of the LibreOffice file formats.
Related articles:
- How to extract, delete and edit metadata in LibreOffice files (100%)
- Best Kali Linux tools in WSL (Windows Subsystem for Linux) (Part 2) (61.8%)
- Program for removing sensitive information from a document (57.6%)
- How to see and change timestamps in Linux. How to perform timestamps-based searching (57.5%)
- Guide to GPS Metadata in Photos (56.3%)
- The complete guide to Wine: from installation to advanced usage (RANDOM - 0.1%)