Pymupdf Readthedocs Io en Latest 2
Pymupdf Readthedocs Io en Latest 2
Release 1.19.3
Jorj X. McKie
1 Introduction 1
1.1 Note on the Name fitz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 License and Copyright . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Covered Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Installation 3
2.1 Step 1: Install MuPDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Step 2: Download and Generate PyMuPDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Enabling Integrated OCR Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Tutorial 5
3.1 Importing the Bindings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Opening a Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Some Document Methods and Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 Accessing Meta Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.5 Working with Outlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.6 Working with Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.6.1 Inspecting the Links, Annotations or Form Fields of a Page . . . . . . . . . . . . . . . . . . 7
3.6.2 Rendering a Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.6.3 Saving the Page Image in a File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.6.4 Displaying the Image in GUIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.6.4.1 wxPython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.6.4.2 Tkinter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.6.4.3 PyQt4, PyQt5, PySide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.6.5 Extracting Text and Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.6.6 Searching for Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.7 PDF Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.7.1 Modifying, Creating, Re-arranging and Deleting Pages . . . . . . . . . . . . . . . . . . . . 11
3.7.2 Joining and Splitting PDF Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.7.3 Embedding Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.7.4 Saving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.8 Closing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.9 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Collection of Recipes 13
4.1 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.1 How to Make Images from Document Pages . . . . . . . . . . . . . . . . . . . . . . . . . . 13
i
4.1.2 How to Increase Image Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.3 How to Create Partial Pixmaps (Clips) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.4 How to Zoom a Clip to a GUI Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.5 How to Create or Suppress Annotation Images . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.6 How to Extract Images: Non-PDF Documents . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.7 How to Extract Images: PDF Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.8 How to Handle Image Masks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.9 How to Make one PDF of all your Pictures (or Files) . . . . . . . . . . . . . . . . . . . . . 18
4.1.10 How to Create Vector Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.11 How to Convert Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.12 How to Use Pixmaps: Glueing Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.13 How to Use Pixmaps: Making a Fractal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.14 How to Interface with NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.15 How to Add Images to a PDF Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.16 How to Control the Size of Inserted Images . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.1 How to Extract all Document Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.2 How to Extract Text from within a Rectangle . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.3 How to Extract Text in Natural Reading Order . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.4 How to Extract Tables from Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.5 How to Mark Extracted Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.6 How to Mark Searched Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.7 How to Mark Non-horizontal Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.8 How to Analyze Font Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.9 How to Insert Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.9.1 How to Write Text Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.9.2 How to Fill a Text Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.9.3 How to Use Non-Standard Encoding . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.1 How to Add and Modify Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.2 How to Use FreeText . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.3 Using Buttons and JavaScript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.4 How to Use Ink Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Drawing and Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 Extracting Drawings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6 Multiprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.7 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.7.1 How to Open with a Wrong File Extension . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.7.2 How to Embed or Attach Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.7.3 How to Delete and Re-Arrange Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.7.4 How to Join PDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.7.5 How to Add Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.7.6 How To Dynamically Clean Up Corrupt PDFs . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.7.7 How to Split Single Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7.8 How to Combine Single Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.7.9 How to Convert Any Document to PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.7.10 How to Deal with Messages Issued by MuPDF . . . . . . . . . . . . . . . . . . . . . . . . 64
4.7.11 How to Deal with PDF Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.8 Common Issues and their Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.8.1 Changing Annotations: Unexpected Behaviour . . . . . . . . . . . . . . . . . . . . . . . . 67
4.8.1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.8.1.2 Cause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.8.1.3 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.8.2 Misplaced Item Insertions on PDF Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
ii
4.8.2.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.8.2.2 Cause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.8.2.3 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.8.3 Missing or Unreadable Extracted Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.8.3.1 Problem: no text is extracted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.8.3.2 Cause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.8.3.3 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.8.3.4 Problem: unreadable text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.8.3.5 Cause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.8.3.6 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.9 Low-Level Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.9.1 How to Iterate through the xref Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.9.2 How to Handle Object Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.9.3 How to Handle Page Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.9.4 How to Access the PDF Catalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.9.5 How to Access the PDF File Trailer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.9.6 How to Access XML Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.9.7 How to Extend PDF Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.9.8 How to Read and Update PDF Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.10 Journalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.10.1 Example Session 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.10.2 Example Session 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5 Module fitz 83
5.1 Invocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Cleaning and Copying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 Extracting Fonts and Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4 Joining PDF Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5 Low Level Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.6 Embedded Files Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.6.1 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.6.2 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.6.3 Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.6.4 Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.6.5 Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.6.6 Copying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.7 Text Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6 Classes 93
6.1 Annot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1.1 Annotation Icons in MuPDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.1.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2 Colorspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3 DisplayList . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.4 Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.4.1 set_metadata() Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.4.2 set_toc() Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.4.3 insert_pdf() Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.4.4 Other Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.5 Font . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.6 Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.7 IRect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.8 Link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.9 linkDest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
iii
6.10 Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.10.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.10.2 Shifting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.10.3 Flipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.10.4 Shearing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.10.5 Rotating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.11 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.12 Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.12.1 Modifying Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.12.2 Description of get_links() Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
6.12.3 Notes on Supporting Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.12.3.1 Reading (pertains to method get_links() and the first_link property chain) . . . . . . 205
6.12.3.2 Writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.12.4 Homologous Methods of Document and Page . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.13 Pixmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.13.1 Supported Input Image Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
6.13.2 Supported Output Image Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
6.14 Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
6.15 Quad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.15.1 Remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
6.15.2 Containment Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
6.16 Rect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
6.17 Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
6.17.1 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
6.17.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
6.17.3 Common Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.18 TextPage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
6.18.1 Structure of Dictionary Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6.18.1.1 Page Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
6.18.1.2 Block Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
6.18.1.3 Line Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.18.1.4 Span Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.18.1.5 Character Dictionary for extractRAWDICT() . . . . . . . . . . . . . . . . . . 254
6.19 TextWriter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
6.20 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
6.20.1 Example Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
6.21 Widget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
6.21.1 Standard Fonts for Widgets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
6.21.2 Supported Widget Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
iv
8.3.3 Perform Text Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
8.3.4 Extract Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
8.3.5 Further Performance improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
8.3.5.1 Pixmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
8.3.5.2 TextPage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
9 Glossary 297
v
15.2 PDF Base 14 Fonts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
15.3 Adobe PDF References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
15.4 Using Python Sequences as Arguments in PyMuPDF . . . . . . . . . . . . . . . . . . . . . . . . . . 325
15.5 Ensuring Consistency of Important Objects in PyMuPDF . . . . . . . . . . . . . . . . . . . . . . . . 326
15.6 Design of Method Page.show_pdf_page() . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
15.6.1 Purpose and Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
15.6.2 Technical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
15.7 Redirecting Error and Warning Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Index 373
vi
CHAPTER 1
Introduction
PyMuPDF is a Python binding for MuPDF – a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which
is maintained and developed by Artifex Software, Inc
MuPDF can access files in PDF, XPS, OpenXPS, CBZ, EPUB and FB2 (e-books) formats, and it is known for its top
performance and high rendering quality.
MuPDF stands out among all similar products for its top rendering capability and unsurpassed processing speed. At
the same time, its “light weight” makes it an excellent choice for platforms where resources are typically limited, like
smartphones.
Check this out yourself and compare the various free PDF-viewers. In terms of speed and rendering quality Suma-
traPDF ranges at the top (apart from MuPDF’s own standalone viewer) – since it has changed its library basis to
MuPDF!
With PyMuPDF you can access files with extensions like “.pdf”, “.xps”, “.oxps”, “.cbz”, “.fb2” or “.epub”. In addition,
about 10 popular image formats can also be opened and handled like documents.
PyMuPDF provides access to many important functions of MuPDF from within a Python environment, and we are
continuously seeking to expand this function set.
PyMuPDF runs and has been tested on Mac, Linux and Windows for Python versions 3.6 and up. Other platforms
should work too, as long as MuPDF and Python support them.
PyMuPDF is hosted on GitHub and registered on PyPI.
For MS Windows, Mac OSX and Linux Python wheels are available – please see the installation chapter.
The GitHub repository PyMuPDF-Utilities contains a full range of examples, demonstrations and use cases.
1
PyMuPDF Documentation, Release 1.19.3
The top level Python import name for this library is “fitz”. This has historical reasons:
The original rendering library for MuPDF was called Libart.
“After Artifex Software acquired the MuPDF project, the development focus shifted on writing a new modern graphics
library called “Fitz”. Fitz was originally intended as an R&D project to replace the aging Ghostscript graphics
library, but has instead become the rendering engine powering MuPDF.” (Quoted from Wikipedia).
So PyMuPDF cannot coexist with packages named “fitz” in the same Python environment.
In order to comply with MuPDF’s dual licensing model, PyMuPDF has entered into an agreement with Artifex who
has the right to sublicense PyMuPDF to third parties.
PyMuPDF and MuPDF are now available under both, open-source AGPL and commercial license agreements. Please
read the full text of the AGPL license agreement, available in the distribution material (file COPYING) and here,
to ensure that your use case complies with the guidelines of the license. If you determine you cannot meet the
requirements of the AGPL, please contact Artifex for more information regarding a commercial license.
Artifex is the exclusive commercial licensing agent for MuPDF.
Artifex, the Artifex logo, MuPDF, and the MuPDF logo are registered trademarks of Artifex Software Inc. © 2021
Artifex Software, Inc. All rights reserved.
Note: The major and minor versions of PyMuPDF and MuPDF will always be the same. Only the third qualifier
(patch level) may deviate from that of MuPDF.
2 Chapter 1. Introduction
CHAPTER 2
Installation
PyMuPDF can be installed from Python wheels for Windows (32bit and 64bit), Linux (64bit, Intel and ARM) and
Mac OSX (64bit, Intel), Python versions 3.6 and up:
PyMuPDF does not support Python versions prior to 3.6. Some older wheels can be found here. Please note that we
generally follow the official Python release schedules. For Python versions dropping out of official support this means
that generation of wheels will eventually be ceased.
There are no mandatory external dependencies. However, some optional feature are available only if additional
components are installed:
• Pillow is required for Pixmap.pil_save() and Pixmap.pil_tobytes().
• fontTools is required for Document.subset_fonts().
• pymupdf-fonts is a collection of nice fonts to be used for text output methods.
• Tesseract-OCR for optical character recognition in images and document pages. Tesseract is separate soft-
ware, not a Python package. To enable OCR functions in PyMuPDF, the system environment variable
"TESSDATA_PREFIX" must be defined and contain the tessdata folder name of the Tesseract installa-
tion location.
Note: You can install these additional components at any time – before or after installing PyMuPDF. PyMuPDF will
detect their presence during import or when the respective functions are being used.
3
PyMuPDF Documentation, Release 1.19.3
{
"include_dirs": ["folder1", "folder2", "folder3", ...],
"library_dirs": ["folder1", "folder2", "folder3", ...],
}
Note: You can also install from sources of the Github repository. These do not contain the pre-generated files
fitz.py or fitz_wrap.c, which instead are generated by the installation script setup.py. To use it, SWIG
must be installed on your system.
If you do not intend to use this feature, this step can be skipped. Otherwise, it is required for both installation paths:
from wheels and from sources.
PyMuPDF will contain all the logic to support OCR functions. Tesseract is however not a Python package, but separate
software that must be installed on the system.
To use it, (Py-) MuPDF needs to be told the location of Tesseract’s language support folder. This currently happens
via storing that folder name in the environment variable "TESSDATA_PREFIX".
In Windows, a typical way to define this name is:
export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata
Caution: Setting this environment variable must happen outside Python – before starting your script. Manipulat-
ing os.environ will not work!
4 Chapter 2. Installation
CHAPTER 3
Tutorial
This tutorial will show you the use of PyMuPDF, MuPDF in Python, step by step.
Because MuPDF supports not only PDF, but also XPS, OpenXPS, CBZ, CBR, FB2 and EPUB formats, so does
PyMuPDF1 . Nevertheless, for the sake of brevity we will only talk about PDF files. At places where indeed only PDF
files are supported, this will be mentioned explicitely.
The Python bindings to MuPDF are made available by this import statement. We also show here how your version can
be checked:
This creates the Document object doc. filename must be a Python string (or a pathlib.Path) specifying the name
of an existing file.
It is also possible to open a document from memory data, or to create a new, empty PDF. See Document for details.
You can also use Document as a context manager.
1 PyMuPDF lets you also open several image file types just like normal documents. See section Supported Input Image Formats in chapter
5
PyMuPDF Documentation, Release 1.19.3
A document contains many attributes and functions. Among them are meta information (like “author” or “subject”),
number of total pages, outline and encryption information.
PyMuPDF fully supports standard metadata. Document.metadata is a Python dictionary with the following keys.
It is available for all document types, though not all entries may always contain data. For details of their meanings
and formats consult the respective manuals, e.g. Adobe PDF References for PDF. Further information can also be
found in chapter Document. The meta data fields are strings or None if not otherwise indicated. Also be aware that
not all of them always contain meaningful data – even if they are not None.
Key Value
producer producer (producing software)
format format: ‘PDF-1.4’, ‘EPUB’, etc.
encryption encryption method used if any
author author
modDate date of last modification
keywords keywords
title title
creationDate date of creation
creator creating application
subject subject
Note: Apart from these standard metadata, PDF documents starting from PDF version 1.4 may also contain so-
called “metadata streams” (see also stream). Information in such streams is coded in XML. PyMuPDF deliberately
contains no XML components, so we do not directly support access to information contained therein. But you can
extract the stream as a whole, inspect or modify it using a package like lxml and then store the result back into the
PDF. If you want, you can also delete these data altogether.
Note: There are two utility scripts in the repository that import (PDF only) resp. export metadata from resp. to CSV
files.
The easiest way to get all outlines (also called “bookmarks”) of a document, is by loading its table of contents:
6 Chapter 3. Tutorial
PyMuPDF Documentation, Release 1.19.3
toc = doc.get_toc()
This will return a Python list of lists [[lvl, title, page, . . . ], . . . ] which looks much like a conventional table of contents
found in books.
lvl is the hierarchy level of the entry (starting from 1), title is the entry’s title, and page the page number (1-based!).
Other parameters describe details of the bookmark target.
Note: There are two utility scripts in the repository that import (PDF only) resp. export table of contents from resp.
to CSV files.
Any integer -∞ < pno < page_count is possible here. Negative numbers count backwards from the end, so
doc[-1] is the last page, like with Python sequences.
Some more advanced way would be using the document as an iterator over its pages:
Once you have your page, here is what you would typically do with it:
Links are shown as “hot areas” when a document is displayed with some viewer software. If you click while your
cursor shows a hand symbol, you will usually be taken to the taget that is encoded in that hot area. Here is how to get
all links:
If dealing with a PDF document page, there may also exist annotations (Annot) or form fields (Widget), each of which
have their own iterators:
pix = page.get_pixmap()
pix is a Pixmap object which (in this case) contains an RGB image of the page, ready to be used for many purposes.
Method Page.get_pixmap() offers lots of variations for controlling the image: resolution / DPI, colorspace
(e.g. to produce a grayscale image or an image with a subtractive color scheme), transparency, rotation, mirroring,
shifting, shearing, etc. For example: to create an RGBA image (i.e. containing an alpha channel), specify pix =
page.get_pixmap(alpha=True).
A Pixmap contains a number of methods and attributes which are referenced below. Among them are the integers
width, height (each in pixels) and stride (number of bytes of one horizontal image line). Attribute samples represents
a rectangular area of bytes representing the image data (a Python bytes object).
Note: You can also create a vector image of a page by using Page.get_svg_image(). Refer to this Wiki for
details.
pix.save("page-%i.png" % page.number)
We can also use it in GUI dialog managers. Pixmap.samples represents an area of bytes of all the pixels as a
Python bytes object. Here are some examples, find more in the examples directory.
3.6.4.1 wxPython
Consult their documentation for adjustments to RGB(A) pixmaps and, potentially, specifics for your wxPython release:
8 Chapter 3. Tutorial
PyMuPDF Documentation, Release 1.19.3
if pix.alpha:
bitmap = wx.Bitmap.FromBufferRGBA(pix.width, pix.height, pix.samples)
else:
bitmap = wx.Bitmap.FromBuffer(pix.width, pix.height, pix.samples)
3.6.4.2 Tkinter
If you are looking for a complete Tkinter script paging through any supported document, here it is! It can also zoom
into pages, and it runs under Python 2 or 3. It requires the extremely handy PySimpleGUI pure Python package.
Again, you also can get along without using Pillow. Qt’s QImage luckily supports native Python pointers, so the
following is the recommended way to create Qt images:
We can also extract all text, images and other information of a page in many different forms, and levels of detail:
text = page.get_text(opt)
Use one of the following strings for opt to obtain different formats2 :
• “text”: (default) plain text with line breaks. No formatting, no text position details, no images.
• “blocks”: generate a list of text blocks (= paragraphs).
• “words”: generate a list of words (strings not containing spaces).
• “html”: creates a full visual version of the page including any images. This can be displayed with your internet
browser.
• “dict” / “json”: same information level as HTML, but provided as a Python dictionary or resp. JSON string.
See TextPage.extractDICT() for details of its structure.
• “rawdict” / “rawjson”: a super-set of “dict” / “json”. It additionally provides character detail information like
XML. See TextPage.extractRAWDICT() for details of its structure.
• “xhtml”: text information level as the TEXT version but includes images. Can also be displayed by internet
browsers.
• “xml”: contains no images, but full position and font information down to each single text character. Use an
XML module to interpret.
To give you an idea about the output of these alternatives, we did text example extracts. See Appendix 2: Details on
Text Extraction.
You can find out, exactly where on a page a certain text string appears:
areas = page.search_for("mupdf")
This delivers a list of rectangles (see Rect), each of which surrounds one occurrence of the string “mupdf” (case
insensitive). You could use this information to e.g. highlight those areas (PDF only) or create a cross reference of the
document.
Please also do have a look at chapter Working together: DisplayList and TextPage and at demo programs demo.py and
demo-lowlevel.py. Among other things they contain details on how the TextPage, Device and DisplayList classes can
be used for a more direct control, e.g. when performance considerations suggest it.
PDFs are the only document type that can be modified using PyMuPDF. Other file types are read-only.
However, you can convert any document (including images) to a PDF and then apply all PyMuPDF features to the
conversion result. Find out more here Document.convert_to_pdf(), and also look at the demo script pdf-
converter.py which can convert any supported document to PDF.
Document.save() always stores a PDF in its current (potentially modified) state on disk.
You normally can choose whether to save to a new file, or just append your modifications to the existing one (“incre-
mental save”), which often is very much faster.
The following describes ways how you can manipulate PDF documents. This description is by no means complete:
much more can be found in the following chapters.
2 Page.get_text() is a convenience wrapper for several methods of another PyMuPDF class, TextPage. The names of these methods
10 Chapter 3. Tutorial
PyMuPDF Documentation, Release 1.19.3
There are several ways to manipulate the so-called page tree (a structure describing all the pages) of a PDF:
Document.delete_page() and Document.delete_pages() delete pages.
Document.copy_page(), Document.fullcopy_page() and Document.move_page() copy or move
a page to other locations within the same document.
Document.select() shrinks a PDF down to selected pages. Parameter is a sequence3 of the page numbers that
you want to keep. These integers must all be in range 0 <= i < page_count. When executed, all pages missing in this
list will be deleted. Remaining pages will occur in the sequence and as many times (!) as you specify them.
So you can easily create new PDFs with
• the first or last 10 pages,
• only the odd or only the even pages (for doing double-sided printing),
• pages that do or don’t contain a given text,
• reverse the page sequence, . . .
. . . whatever you can think of.
The saved new document will contain links, annotations and bookmarks that are still valid (i.a.w. either pointing to a
selected page or to some external resource).
Document.insert_page() and Document.new_page() insert new pages.
Pages themselves can moreover be modified by a range of methods (e.g. page rotation, annotation and link mainte-
nance, text and image insertion).
Method Document.insert_pdf() copies pages between different PDF documents. Here is a simple joiner
example (doc1 and doc2 being openend PDFs):
Here is a snippet that splits doc1. It creates a new document of its first and its last 10 pages:
More can be found in the Document chapter. Also have a look at PDFjoiner.py.
PDFs can be used as containers for abitrary data (executables, other PDFs, text or binary files, etc.) much like ZIP
archives.
3 “Sequences” are Python objects conforming to the sequence protocol. These objects implement a method named __getitem__(). Best known
examples are Python tuples and lists. But array.array, numpy.array and PyMuPDF’s “geometry” objects (Operator Algebra for Geometry Objects)
are sequences, too. Refer to Using Python Sequences as Arguments in PyMuPDF for details.
PyMuPDF fully supports this feature via Document embfile_* methods and attributes. For some detail read Appendix
3: Considerations on Embedded Files, consult the Wiki on embedding files, or the example scripts embedded-copy.py,
embedded-export.py, embedded-import.py, and embedded-list.py.
3.7.4 Saving
As mentioned above, Document.save() will always save the document in its current state.
You can write changes back to the original PDF by specifying option incremental=True. This process is (usually)
extremely fast, since changes are appended to the original file without completely rewriting it.
Document.save() options correspond to options of MuPDF’s command line utility mutool clean, see the following
table.
Note: For an explanation of terms like object, stream, xref consult the Glossary chapter.
For example, mutool clean -ggggz file.pdf yields excellent compression results. It corresponds to doc.save(filename,
garbage=4, deflate=True).
3.8 Closing
It is often desirable to “close” a document to relinquish control of the underlying file to the OS, while your program
continues.
This can be achieved by the Document.close() method. Apart from closing the underlying file, buffer areas
associated with the document will be freed.
Also have a look at PyMuPDF’s Wiki pages. Especially those named in the sidebar under title “Recipes” cover over
15 topics written in “How-To” style.
This document also contains a Collection of Recipes. This chapter has close connection to the aforementioned recipes,
and it will be extended with more content over time.
12 Chapter 3. Tutorial
CHAPTER 4
Collection of Recipes
A collection of recipes in “How-To” format for using PyMuPDF. We aim to extend this section over time. Where
appropriate we will refer to the corresponding Wiki pages, but some duplication may still occur.
4.1 Images
This little script will take a document filename and generate a PNG file from each of its pages.
The document can be any supported type like PDF, XPS, etc.
The script works as a command line tool which expects the filename being supplied as a parameter. The generated
image files (1 per page) are stored in the directory of the script:
The script directory will now contain PNG image files named page-0.png, page-1.png, etc. Pictures have the dimension
of their pages with width and height rounded to integers, e.g. 595 x 842 pixels for an A4 portrait sized page. They
will have a resolution of 96 dpi in x and y dimension and have no transparency. You can change all that – for how to
do this, read the next sections.
13
PyMuPDF Documentation, Release 1.19.3
The image of a document page is represented by a Pixmap, and the simplest way to create a pixmap is via method
Page.get_pixmap().
This method has many options to influence the result. The most important among them is the Matrix, which lets you
zoom, rotate, distort or mirror the outcome.
Page.get_pixmap() by default will use the Identity matrix, which does nothing.
In the following, we apply a zoom factor of 2 to each dimension, which will generate an image with a four times better
resolution for us (and also about 4 times the size):
Since version 1.19.2 there is a more direct way to set the resolution: Parameter "dpi" (dots per inch) can be used
in place of "matrix". To create a 300 dpi image of a page specify pix = page.get_pixmap(dpi=300).
Apart from notation brevity, this approach has the additonal advantage that the dpi value is saved with the image file
– which does not happen automatically when using the Matrix notation.
You do not always need or want the full image of a page. This is the case e.g. when you display the image in a GUI
and would like to fill the respective window with a zoomed part of the page.
Let’s assume your GUI window has room to display a full document page, but you now want to fill this room with the
bottom right quarter of your page, thus using a four times better resolution.
To achieve this, define a rectangle equal to the area you want to appear in the GUI and call it “clip”. One way of
constructing rectangles in PyMuPDF is by providing two diagonally opposite corners, which is what we are doing
here.
In the above we construct clip by specifying two diagonally opposite points: the middle point mp of the page rectangle,
and its bottom right, rect.br.
Please also read the previous section. This time we want to compute the zoom factor for a clip, such that its image
best fits a given GUI window. This means, that the image’s width or height (or both) will equal the window dimension.
For the following code snippet you need to provide the WIDTH and HEIGHT of your GUI’s window that should
receive the page’s clip rectangle.
# WIDTH: width of the GUI window
# HEIGHT: height of the GUI window
# clip: a subrectangle of the document page
# compare width/height ratios of image and window
For the other way round, now assume you have the zoom factor and need to compute the fitting clip.
In this case we have zoom = HEIGHT/clip.height = WIDTH/clip.width, so we must set clip.
height = HEIGHT/zoom and, clip.width = WIDTH/zoom. Choose the top-left point tl of the clip on
the page to compute the right pixmap:
width = WIDTH / zoom
height = HEIGHT / zoom
clip = fitz.Rect(tl, tl.x + width, tl.y + height)
# ensure we still are inside the page
clip &= page.rect
mat = fitz.Matrix(zoom, zoom)
pix = fitz.Pixmap(matrix=mat, clip=clip)
Normally, the pixmap of a page also shows the page’s annotations. Occasionally, this may not be desirable.
To suppress the annotation images on a rendered page, just specify annots=False in Page.get_pixmap().
You can also render annotations separately: they have their own Annot.get_pixmap() method. The resulting
pixmap has the same dimensions as the annotation rectangle.
4.1. Images 15
PyMuPDF Documentation, Release 1.19.3
In contrast to the previous sections, this section deals with extracting images contained in documents, so they can be
displayed as part of one or more pages.
If you want recreate the original image in file form or as a memory area, you have basically two options:
1. Convert your document to a PDF, and then use one of the PDF-only extraction methods. This snippet will
convert a document to PDF:
2. Use Page.get_text() with the “dict” parameter. This works for all document types. It will extract all text
and images shown on the page, formatted as a Python dictionary. Every image will occur in an image block,
containing meta information and the binary image data. For details of the dictionary’s structure, see TextPage.
The method works equally well for PDF files. This creates a list of all images shown on a page:
>>> d = page.get_text("dict")
>>> blocks = d["blocks"] # the list of block dictionaries
>>> imgblocks = [b for b in blocks if b["type"] == 1]
>>> pprint(imgblocks[0])
{'bbox': (100.0, 135.8769989013672, 300.0, 364.1230163574219),
'bpc': 8,
'colorspace': 3,
'ext': 'jpeg',
'height': 501,
'image': b'\xff\xd8\xff\xe0\x00\x10JFIF\...', # CAUTION: LARGE!
'size': 80518,
'transform': (200.0, 0.0, -0.0, 228.2460174560547, 100.0, 135.8769989013672),
'type': 1,
'width': 439,
'xres': 96,
'yres': 96}
Like any other “object” in a PDF, images are identified by a cross reference number (xref, an integer). If you know
this number, you have two ways to access the image’s data:
1. Create a Pixmap of the image with instruction pix = fitz.Pixmap(doc, xref). This method is very fast (single
digit micro-seconds). The pixmap’s properties (width, height, . . . ) will reflect the ones of the image. In this
case there is no way to tell which image format the embedded original has.
2. Extract the image with img = doc.extract_image(xref). This is a dictionary containing the binary image data as
img[“image”]. A number of meta data are also provided – mostly the same as you would find in the pixmap
of the image. The major difference is string img[“ext”], which specifies the image format: apart from “png”,
strings like “jpeg”, “bmp”, “tiff”, etc. can also occur. Use this string as the file extension if you want to store
to disk. The execution speed of this method should be compared to the combined speed of the statements pix
= fitz.Pixmap(doc, xref);pix.tobytes(). If the embedded image is in PNG format, the speed of Document.
extract_image() is about the same (and the binary image data are identical). Otherwise, this method is
thousands of times faster, and the image data is much smaller.
The question remains: “How do I know those ‘xref’ numbers of images?”. There are two answers to this:
a. “Inspect the page objects:” Loop through the items of Page.get_images(). It is a list of list, and its items
look like [xref, smask, . . . ], containing the xref of an image. This xref can then be used with one of the
above methods. Use this method for valid (undamaged) documents. Be wary however, that the same image
may be referenced multiple times (by different pages), so you might want to provide a mechanism avoiding
multiple extracts.
b. “No need to know:” Loop through the list of all xrefs of the document and perform a Document.
extract_image() for each one. If the returned dictionary is empty, then continue – this xref is no image.
Use this method if the PDF is damaged (unusable pages). Note that a PDF often contains “pseudo-images”
(“stencil masks”) with the special purpose of defining the transparency of some other image. You may want to
provide logic to exclude those from extraction. Also have a look at the next section.
For both extraction approaches, there exist ready-to-use general purpose scripts:
extract-imga.py extracts images page by page:
Some images in PDFs are accompanied by image masks. In their simplest form, masks represent alpha (transparency)
bytes stored as separate images. In order to reconstruct the original of an image, which has a mask, it must be
“enriched” with transparency bytes taken from its mask.
4.1. Images 17
PyMuPDF Documentation, Release 1.19.3
Whether an image does have such a mask can be recognized in one of two ways in PyMuPDF:
1. An item of Document.get_page_images() has the general format (xref, smask, ...), where
xref is the image’s xref and smask, if positive, is the xref of a mask.
2. The (dictionary) results of Document.extract_image() have a key “smask”, which also contains any
mask’s xref if positive.
If smask == 0 then the image encountered via xref can be processed as it is.
To recover the original image using PyMuPDF, the procedure depicted as follows must be executed:
Step (1) creates a pixmap of the basic image. Step (2) does the same with the image mask. Step (3) adds an alpha
channel and fills it with transparency information.
The scripts extract-imga.py, and extract-imgb.py above also contain this logic.
4.1.9 How to Make one PDF of all your Pictures (or Files)
We show here three scripts that take a list of (image and other) files and put them all in one PDF.
Method 1: Inserting Images as Pages
The first one converts each image to a PDF page with the same dimensions. The result will be a PDF with one page
per image. It will only work for supported image file formats:
for i, f in enumerate(imglist):
img = fitz.open(os.path.join(imgdir, f)) # open pic as document
rect = img[0].rect # pic dimension
(continues on next page)
doc.save("all-my-pics.pdf")
This will generate a PDF only marginally larger than the combined pictures’ size. Some numbers on performance:
The above script needed about 1 minute on my machine for 149 pictures with a total size of 514 MB (and about the
same resulting PDF size).
Look here for a more complete source code: it offers a directory selection dialog and skips unsupported files and
non-file entries.
Note: We might have used Page.insert_image() instead of Page.show_pdf_page(), and the result
would have been a similar looking file. However, depending on the image type, it may store images uncompressed.
Therefore, the save option deflate = True must be used to achieve a reasonable file size, which hugely increases the
runtime for large numbers of images. So this alternative cannot be recommended here.
for i, f in enumerate(imglist):
(continues on next page)
4.1. Images 19
PyMuPDF Documentation, Release 1.19.3
doc.save("all-my-pics-embedded.pdf")
This is by far the fastest method, and it also produces the smallest possible output file size. The above pictures needed
20 seconds on my machine and yielded a PDF size of 510 MB. Look here for a more complete source code: it offers
a directory selection dialog and skips non-file entries.
Method 3: Attaching Files
A third way to achieve this task is attaching files via page annotations see here for the complete source code.
This has a similar performance as the previous script and it also produces a similar file size. It will produce PDF pages
which show a ‘FileAttachment’ icon for each attached file.
Note: Both, the embed and the attach methods can be used for arbitrary files – not just images.
Note: We strongly recommend using the awesome package PySimpleGUI to display a progress meter for tasks that
may run for an extended time span. It’s pure Python, uses Tkinter (no additional GUI package) and requires just one
more line of code!
The usual way to create an image from a document page is Page.get_pixmap(). A pixmap represents a raster
image, so you must decide on its quality (i.e. resolution) at creation time. It cannot be changed later.
PyMuPDF also offers a way to create a vector image of a page in SVG format (scalable vector graphics, defined in
XML syntax). SVG images remain precise across zooming levels (of course with the exception of any raster graphic
elements embedded therein).
Instruction svg = page.get_svg_image(matrix=fitz.Identity) delivers a UTF-8 string svg which can be stored with
extension “.svg”.
4.1. Images 21
PyMuPDF Documentation, Release 1.19.3
Just as a feature among others, PyMuPDF’s image conversion is easy. It may avoid using other graphics packages like
PIL/Pillow in many cases.
Notwithstanding that interfacing with Pillow is almost trivial.
Remarks
1. The input argument of fitz.Pixmap(arg) can be a file or a bytes / io.BytesIO object containing an image.
2. Instead of an output file, you can also create a bytes object via pix.tobytes(“yyy”) and pass this around.
3. As a matter of course, input and output formats must be compatible in terms of colorspace and transparency.
The Pixmap class has batteries included if adjustments are needed.
pix = fitz.Pixmap("myfamily.jpg")
pix.save("myfamily.psd")
Note: Convert JPEG to Tkinter PhotoImage. Any RGB / no-alpha image works exactly the same. Conversion
to one of the Portable Anymap formats (PPM, PGM, etc.) does the trick, because they are supported by all Tkinter
versions:
Note: Convert PNG with alpha to Tkinter PhotoImage. This requires removing the alpha bytes, before we can do
the PPM conversion:
This shows how pixmaps can be used for purely graphical, non-document purposes. The script reads an image file and
creates a new image which consist of 3 * 4 tiles of the original:
import fitz
src = fitz.Pixmap("img-7edges.png") # create pixmap from a picture
col = 3 # tiles per row
lin = 4 # tiles per column
tar_w = src.width * col # width of target
tar_h = src.height * lin # height of target
tar_pix.save("tar.png")
4.1. Images 23
PyMuPDF Documentation, Release 1.19.3
Here is another Pixmap example that creates Sierpinski’s Carpet – a fractal generalizing the Cantor Set to two
dimensions. Given a square carpet, mark its 9 sub-suqares (3 times 3) and cut out the one in the center. Treat each of
the remaining eight sub-squares in the same way, and continue ad infinitum. The end result is a set with area zero and
fractal dimension 1.8928. . .
This script creates an approximate image of it as a PNG, by going down to one-pixel granularity. To increase the image
precision, change the value of n (precision):
t0 = time.perf_counter()
ir = (0, 0, d, d) # the pixmap rectangle
return
#==============================================================================
# main program
#==============================================================================
# now start punching holes into the pixmap
punch(0, 0, d)
t1 = time.perf_counter()
pm.save("sierpinski-punch.png")
t2 = time.perf_counter()
print ("%g sec to create / fill the pixmap" % round(t1-t0,3))
print ("%g sec to save the image" % round(t2-t1,3))
4.1. Images 25
PyMuPDF Documentation, Release 1.19.3
This shows how to create a PNG file from a numpy array (several times faster than most other methods):
import numpy as np
import fitz
#==============================================================================
# create a fun-colored width * height PNG with fitz and numpy
#==============================================================================
height = 150
width = 100
bild = np.ndarray((height, width, 3), dtype=np.uint8)
for i in range(height):
for j in range(width):
# one pixel (some fun coloring)
bild[i, j] = [(i+j)%256, i%256, j%256]
There are two methods to add images to a PDF page: Page.insert_image() and Page.show_pdf_page().
Both methods have things in common, but there also exist differences.
Basic code pattern for Page.insert_image(). Exactly one of the parameters filename / stream / pixmap must
be given, if not re-inserting an existing image:
page.insert_image(
rect, # where to place the image (rect-like)
filename=None, # image in a file
stream=None, # image in memory (bytes)
pixmap=None, # image from pixmap
mask=None, # specify alpha channel separately
rotate=0, # rotate (int, multiple of 90)
xref=0, # re-use existing image
oc=0, # control visibility via OCG / OCMD
keep_proportion=True, # keep aspect ratio
overlay=True, # put in foreground
)
Basic code pattern for Page.show_pdf_page(). Source and target PDF must be different Document objects (but
may be opened from the same file):
page.show_pdf_page(
rect, # where to place the image (rect-like)
src, # source PDF
pno=0, # page number in source PDF
clip=None, # only display this area (rect-like)
rotate=0, # rotate (float, any value)
oc=0, # control visibility via OCG / OCMD
keep_proportion=True, # keep aspect ratio
overlay=True, # put in foreground
)
For the following discussion, please also consult the previous section.
If the pixmap parameter is used in Page.insert_image(), the image is always stored in uncompressed PNG
format. This is independent from in which way the pixmap has originally been created.
For filename and stream parameters, the original image format, quality and size are preserved (JPEG, BMP,
JPEG2000, etc.). However: the method takes the following actions:
1. Create an internal pixmap to see if the image is transparent.
2. If not transparent, discard pixmap and insert image in original format.
3. If transparent, create a new internal image and an image mask containing transparency information – both in
pixmap format – and store both pixmap images. This will be uncompressed PNG format again.
Here is what you can do to take a closer control:
1. Often you know already before, whether an image is transparent. For example, if you have a PIL image, check
the last letter of img.mode. If you see “RGBA” you have an RGB image with an alpha channel.
2. If your image is not transparent, include alpha=0 in your method arguments. The method will then skip
internal pixmap creation and store the image as is.
3. If your image has alpha, you can use the following snippet to create two sub-images: (1) the base-image, (2) the
mask image (alpha values). Then insert them combined using the stream and mask arguments. Again, the
method will omit any alpha-checking or conversion and store image and mask as is:
4.1. Images 27
PyMuPDF Documentation, Release 1.19.3
4.2 Text
This script will take a document filename and generate a text file from all of its text.
The document can be any supported type like PDF, XPS, etc.
The script works as a command line tool which expects the document filename supplied as a parameter. It generates
one text file named “filename.txt” in the script directory. Text of pages is separated by a form feed character:
The output will be plain text as it is coded in the document. No effort is made to prettify in any way. Specifically for
PDF, this may mean output not in usual reading order, unexpected line breaks and so forth.
You have many options to cure this – see chapter Appendix 2: Details on Text Extraction. Among them are:
1. Extract text in HTML format and store it as a HTML document, so it can be viewed in any browser.
2. Extract text as a list of text blocks via Page.get_text(“blocks”). Each item of this list contains position informa-
tion for its text, which can be used to establish a convenient reading order.
3. Extract a list of single words via Page.get_text(“words”). Its items are words with position information. Use it
to determine text contained in a given rectangle – see next section.
See the following two section for examples and further explanations.
There is now (v1.18.0) more than one way to achieve this. We therefore have created a folder in the PyMuPDF-Utilities
repository specifically dealing with this topic.
One of the common issues with PDF text extraction is, that text may not appear in any particular reading order.
Responsible for this effect is the PDF creator (software or a human). For example, page headers may have been
inserted in a separate step – after the document had been produced. In such a case, the header text will appear at the
end of a page text extraction (although it will be correctly shown by PDF viewer software). For example, the following
snippet will add some header and footer lines to an existing PDF:
doc = fitz.open("some.pdf")
header = "Header" # text in header
footer = "Page %i of %i" # text in footer
for page in doc:
page.insert_text((50, 50), header) # insert header
page.insert_text( # insert footer 50 points above page bottom
(50, page.rect.height - 50),
footer % (page.number + 1, len(doc)),
)
The text sequence extracted from a page modified in this way will look like this:
1. original text
2. header line
3. footer line
PyMuPDF has several means to re-establish some reading sequence or even to re-generate a layout close to the original:
1. Use sort parameter of Page.get_text(). It will sort the output from top-left to bottom-right (ignored for
XHTML, HTML and XML output).
2. Use the fitz module in CLI: python -m fitz gettext ..., which produces a text file where text has
been re-arranged in layout-preserving mode. Many options are available to control the output.
You can also use the above mentioned script with your modifications.
If you see a table in a document, you are not normally looking at something like an embedded Excel or other identifi-
able object. It usually is just text, formatted to appear as appropriate.
Extracting a tabular data from such a page area therefore means that you must find a way to (1) graphically indicate
table and column borders, and (2) then extract text based on this information.
The wxPython GUI script wxTableExtract.py strives to exactly do that. You may want to have a look at it and adjust
it to your liking.
4.2. Text 29
PyMuPDF Documentation, Release 1.19.3
There is a standard search function to search for arbitrary text on a page: Page.search_for(). It returns a list
of Rect objects which surround a found occurrence. These rectangles can for example be used to automatically insert
annotations which visibly mark the found text.
This method has advantages and drawbacks. Pros are
• The search string can contain blanks and wrap across lines
• Upper or lower case characters are treated equal
• Word hyphenation at line ends is detected and resolved
• return may also be a list of Quad objects to precisely locate text that is not parallel to either axis – using Quad
output is also recommend, when page rotation is not zero.
But you also have other options:
import sys
import fitz
if new_doc:
doc.save("marked-" + doc.name)
This script uses Page.get_text("words")() to look for a string, handed in via cli parameter. This method
separates a page’s text into “words” using spaces and line breaks as delimiters. Therefore the words in this lists do not
contain these characters. Further remarks:
• If found, the complete word containing the string is marked (underlined) – not only the search string.
• The search string may not contain spaces or other white space.
• As shown here, upper / lower cases are respected. But this can be changed by using the string method lower()
(or even regular expressions) in function mark_word.
4.2. Text 31
PyMuPDF Documentation, Release 1.19.3
The previous section already shows an example for marking non-horizontal text, that was detected by text searching.
But text extraction with the “dict” / “rawdict” options of Page.get_text() may also return text with a non-zero
angle to the x-axis. This is indicated by the value of the line dictionary’s "dir" key: it is the tuple (cosine,
sine) for that angle. If line["dir"] != (1, 0), then the text of all its spans is rotated by (the same) angle !=
0.
The “bboxes” returned by the method however are rectangles only – not quads. So, to mark span text correctly, its
quad must be recovered from the data contained in the line and span dictionary. Do this with the following utility
function (new in v1.18.9):
If you want to mark the complete line or a subset of its spans in one go, use the following snippet (works for v1.18.10
or later):
The spans argument above may specify any sub-list of line["spans"]. In the example above, the second to
second-to-last span are marked. If omitted, the complete line is taken.
To analyze the characteristics of text in a PDF use this elementary script as a starting point:
import fitz
def flags_decomposer(flags):
"""Make font flags human readable."""
l = []
if flags & 2 ** 0:
l.append("superscript")
if flags & 2 ** 1:
l.append("italic")
if flags & 2 ** 2:
l.append("serifed")
else:
l.append("sans")
if flags & 2 ** 3:
l.append("monospaced")
else:
l.append("proportional")
if flags & 2 ** 4:
l.append("bold")
return ", ".join(l)
doc = fitz.open("text-tester.pdf")
page = doc[0]
4.2. Text 33
PyMuPDF Documentation, Release 1.19.3
PyMuPDF provides ways to insert text on new or existing PDF pages with the following features:
• choose the font, including built-in fonts and fonts that are available as files
• choose text characteristics like bold, italic, font size, font color, etc.
• position the text in multiple ways:
– either as simple line-oriented output starting at a certain point,
– or fitting text in a box provided as a rectangle, in which case text alignment choices are also available,
– choose whether text should be put in foreground (overlay existing content),
– all text can be arbitrarily “morphed”, i.e. its appearance can be changed via a Matrix, to achieve effects
like scaling, shearing or mirroring,
– independently from morphing and in addition to that, text can be rotated by integer multiples of 90 degrees.
All of the above is provided by three basic Page, resp. Shape methods:
• Page.insert_font() – install a font for the page for later reference. The result is reflected in the output
of Document.get_page_fonts(). The font can be:
– provided as a file,
– via Font (then use Font.buffer)
– already present somewhere in this or another PDF, or
– be a built-in font.
• Page.insert_text() – write some lines of text. Internally, this uses Shape.insert_text().
• Page.insert_textbox() – fit text in a given rectangle. Here you can choose text alignment features (left,
right, centered, justified) and you keep control as to whether text actually fits. Internally, this uses Shape.
insert_textbox().
Note: Both text insertion methods automatically install the font as necessary.
doc.save("text.pdf")
With this method, only the number of lines will be controlled to not go beyond page height. Surplus lines will not be
written and the number of actual lines will be returned. The calculation uses a line height calculated from the fontsize
and 36 points (0.5 inches) as bottom margin.
Line width is ignored. The surplus part of a line will simply be invisible.
However, for built-in fonts there are ways to calculate the line width beforehand - see get_text_length().
Here is another example. It inserts 4 text strings using the four different rotation options, and thereby explains, how
the text insertion point must be chosen to achieve the desired result:
import fitz
doc = fitz.open()
page = doc.new_page()
# the text strings, each having 3 lines
text1 = "rotate=0\nLine 2\nLine 3"
text2 = "rotate=90\nLine 2\nLine 3"
text3 = "rotate=-90\nLine 2\nLine 3"
text4 = "rotate=180\nLine 2\nLine 3"
red = (1, 0, 0) # the color for the red dots
# the insertion points, each with a 25 pix distance from the corners
p1 = fitz.Point(25, 25)
p2 = fitz.Point(page.rect.width - 25, 25)
p3 = fitz.Point(25, page.rect.height - 25)
p4 = fitz.Point(page.rect.width - 25, page.rect.height - 25)
# create a Shape to draw on
shape = page.new_shape()
(continues on next page)
4.2. Text 35
PyMuPDF Documentation, Release 1.19.3
This script fills 4 different rectangles with text, each time choosing a different rotation value:
import fitz
doc = fitz.open(...) # new or existing PDF
page = doc.new_page() # new page, or choose doc[n]
r1 = fitz.Rect(50,100,100,150) # a 50x50 rectangle
disp = fitz.Rect(55, 0, 55, 0) # add this to get more rects
r2 = r1 + disp # 2nd rect
r3 = r1 + disp * 2 # 3rd rect
r4 = r1 + disp * 3 # 4th rect
t1 = "text with rotate = 0." # the texts we will put in
t2 = "text with rotate = 90."
t3 = "text with rotate = -90."
t4 = "text with rotate = 180."
red = (1,0,0) # some colors
(continues on next page)
Several default values were used above: font “Helvetica”, font size 11 and text alignment “left”. The result will look
like this:
Since v1.14, MuPDF allows Greek and Russian encoding variants for the Base14_Fonts. In PyMuPDF this is
supported via an additional encoding argument. Effectively, this is relevant for Helvetica, Times-Roman and Courier
(and their bold / italic forms) and characters outside the ASCII code range only. Elsewhere, the argument is ignored.
Here is how to request Russian encoding with the standard font Helvetica:
The valid encoding values are TEXT_ENCODING_LATIN (0), TEXT_ENCODING_GREEK (1), and
TEXT_ENCODING_CYRILLIC (2, Russian) with Latin being the default. Encoding can be specified by all rele-
vant font and text insertion methods.
By the above statement, the fontname helv is automatically connected to the Russian font variant of Helvetica. Any
subsequent text insertion with this fontname will use the Russian Helvetica encoding.
If you change the fontname just slightly, you can also achieve an encoding “mixture” for the same base font on the
same page:
import fitz
doc=fitz.open()
page = doc.new_page()
shape = page.new_shape()
t="Sômé tèxt wìth nöñ-Lâtîn characterß."
(continues on next page)
4.2. Text 37
PyMuPDF Documentation, Release 1.19.3
The result:
The snippet above indeed leads to three different copies of the Helvetica font in the PDF. Each copy is uniquely
identified (and referenceable) by using the correct upper-lower case spelling of the reserved word “helv”:
4.3 Annotations
In PyMuPDF, new annotations can be added via Page methods. Once an annotation exists, it can be modified to a large
extent using methods of the Annot class.
In contrast to many other tools, initial insert of annotations happens with a minimum number of properties. We leave
it to the programmer to e.g. set attributes like author, creation date or subject.
As an overview for these capabilities, look at the following script that fills a PDF page with most of the available
annotations. Look in the next sections for more special situations:
Dependencies
------------
PyMuPDF v1.17.0
-------------------------------------------------------------------------------
"""
from __future__ import print_function
import gc
import sys
import fitz
print(fitz.__doc__)
if fitz.VersionBind.split(".") < ["1", "17", "0"]:
sys.exit("PyMuPDF v1.17.0+ is needed.")
gc.set_debug(gc.DEBUG_UNCOLLECTABLE)
def print_descr(annot):
"""Print a short description to the right of each annot rect."""
annot.parent.insert_text(
(continues on next page)
4.3. Annotations 39
PyMuPDF Documentation, Release 1.19.3
doc = fitz.open()
page = doc.new_page()
page.set_rotation(0)
annot = page.add_caret_annot(r.tl)
print_descr(annot)
r = r + displ
annot = page.add_freetext_annot(
r,
t1,
fontsize=10,
rotate=90,
text_color=blue,
fill_color=gold,
align=fitz.TEXT_ALIGN_CENTER,
)
annot.set_border(width=0.3, dashes=[2])
annot.update(text_color=blue, fill_color=gold)
print_descr(annot)
r = annot.rect + displ
annot = page.add_text_annot(r.tl, t1)
print_descr(annot)
pos = annot.rect.bl
page.insert_text(pos, strikeout, morph=(pos, fitz.Matrix(-15)))
rl = page.search_for(strikeout, quads=True)
annot = page.add_strikeout_annot(rl[0])
print_descr(annot)
pos = annot.rect.bl
page.insert_text(pos, squiggled, morph=(pos, fitz.Matrix(-20)))
rl = page.search_for(squiggled, quads=True)
(continues on next page)
pos = annot.rect.bl
r = fitz.Rect(pos, pos.x + 75, pos.y + 35) + (0, 20, 0, 20)
annot = page.add_polyline_annot([r.bl, r.tr, r.br, r.tl]) # 'Polyline'
annot.set_border(width=0.3, dashes=[2])
annot.set_colors(stroke=blue, fill=green)
annot.set_line_ends(fitz.PDF_ANNOT_LE_CLOSED_ARROW, fitz.PDF_ANNOT_LE_R_CLOSED_ARROW)
annot.update(fill_color=(1, 1, 0))
print_descr(annot)
r += displ
annot = page.add_polygon_annot([r.bl, r.tr, r.br, r.tl]) # 'Polygon'
annot.set_border(width=0.3, dashes=[2])
annot.set_colors(stroke=blue, fill=gold)
annot.set_line_ends(fitz.PDF_ANNOT_LE_DIAMOND, fitz.PDF_ANNOT_LE_CIRCLE)
annot.update()
print_descr(annot)
r += displ
annot = page.add_line_annot(r.tr, r.bl) # 'Line'
annot.set_border(width=0.3, dashes=[2])
annot.set_colors(stroke=blue, fill=gold)
annot.set_line_ends(fitz.PDF_ANNOT_LE_DIAMOND, fitz.PDF_ANNOT_LE_CIRCLE)
annot.update()
print_descr(annot)
r += displ
annot = page.add_rect_annot(r) # 'Square'
annot.set_border(width=1, dashes=[1, 2])
annot.set_colors(stroke=blue, fill=gold)
annot.update(opacity=0.5)
print_descr(annot)
r += displ
annot = page.add_circle_annot(r) # 'Circle'
annot.set_border(width=0.3, dashes=[2])
annot.set_colors(stroke=blue, fill=gold)
annot.update()
print_descr(annot)
r += displ
annot = page.add_file_annot(
r.tl, b"just anything for testing", "testdata.txt" # 'FileAttachment'
)
print_descr(annot) # annot.rect
r += displ
annot = page.add_stamp_annot(r, stamp=10) # 'Stamp'
annot.set_colors(stroke=green)
annot.update()
print_descr(annot)
4.3. Annotations 41
PyMuPDF Documentation, Release 1.19.3
# some colors
blue = (0,0,1)
green = (0,1,0)
red = (1,0,0)
gold = (1,1,0)
4.3. Annotations 43
PyMuPDF Documentation, Release 1.19.3
Since MuPDF v1.16, ‘FreeText’ annotations no longer support bold or italic versions of the Times-Roman, Helvetica
or Courier fonts.
A big thank you to our user @kurokawaikki, who contributed the following script to circumvent this restriction.
"""
Problem: Since MuPDF v1.16 a 'Freetext' annotation font is restricted to the
"normal" versions (no bold, no italics) of Times-Roman, Helvetica, Courier.
It is impossible to use PyMuPDF to modify this.
If we have 'FreeText' annotations created with PyMuPDF, we can make use of this
JavaScript feature to modify the font - thus circumventing the above restriction.
Note / Caution:
---------------
The JavaScript will **only** work if the file is opened with Adobe Acrobat reader!
When using other PDF viewers, the reaction is unforeseeable.
"""
import sys
import fitz
# ------------------------------------------------
# make a push button for invoking the JavaScript
# ------------------------------------------------
# make it a 'PushButton'
widget.field_type = fitz.PDF_WIDGET_TYPE_BUTTON
widget.field_flags = fitz.PDF_BTN_FIELD_IS_PUSHBUTTON
Ink annotations are used to contain freehand scribbling. A typical example maybe an image of your signature consist-
ing of first name and last name. Technically an ink annotation is implemented as a list of lists of points. Each point
list is regarded as a continuous line connecting the points. Different point lists represent independent line segments of
the annotation.
The following script creates an ink annotation with two mathematical curves (sine and cosine function graphs) as line
segments:
import math
import fitz
#------------------------------------------------------------------------------
# preliminary stuff: create function value lists for sine and cosine
#------------------------------------------------------------------------------
w360 = math.pi * 2 # go through full circle
deg = w360 / 360 # 1 degree as radians
rect = fitz.Rect(100,200, 300, 300) # use this rectangle
first_x = rect.x0 # x starts from left
first_y = rect.y0 + rect.height / 2. # rect middle means y = 0
x_step = rect.width / 360 # rect width means 360 degrees
y_scale = rect.height / 2. # rect height means 2
sin_points = [] # sine values go here
cos_points = [] # cosine values go here
for x in range(362): # now fill in the values
x_coord = x * x_step + first_x # current x coordinate
y = -math.sin(x * deg) # sine
(continues on next page)
4.3. Annotations 45
PyMuPDF Documentation, Release 1.19.3
#------------------------------------------------------------------------------
# create the document with one page
#------------------------------------------------------------------------------
doc = fitz.open() # make new PDF
page = doc.new_page() # give it a page
#------------------------------------------------------------------------------
# add the Ink annotation, consisting of 2 curve segments
#------------------------------------------------------------------------------
annot = page.addInkAnnot((sin_points, cos_points))
# let it look a little nicer
annot.set_border(width=0.3, dashes=[1,]) # line thickness, some dashing
annot.set_colors(stroke=(0,0,1)) # make the lines blue
annot.update() # update the appearance
doc.save("a-inktest.pdf")
PDF files support elementary drawing operations as part of their syntax. This includes basic geometrical objects like
lines, curves, circles, rectangles including specifying colors.
The syntax for such operations is defined in “A Operator Summary” on page 643 of the Adobe PDF References.
Specifying these operators for a PDF page happens in its contents objects.
PyMuPDF implements a large part of the available features via its Shape class, which is comparable to notions like
“canvas” in other packages (e.g. reportlab).
A shape is always created as a child of a page, usually with an instruction like shape = page.new_shape(). The
class defines numerous methods that perform drawing operations on the page’s area. For example, last_point =
shape.draw_rect(rect) draws a rectangle along the borders of a suitably defined rect = fitz.Rect(. . . ).
The returned last_point always is the Point where drawing operation ended (“last point”). Every such elementary
drawing requires a subsequent Shape.finish() to “close” it, but there may be multiple drawings which have one
common finish() method.
In fact, Shape.finish() defines a group of preceding draw operations to form one – potentially rather complex –
graphics object. PyMuPDF provides several predefined graphics in shapes_and_symbols.py which demonstrate how
this works.
If you import this script, you can also directly use its graphics as in the following example:
@author: Jorj
@license: GNU AFFERO GPL V3
This also demonstrates an example usage: how these symbols could be used
as bullet-point symbols in some text.
"""
import fitz
import shapes_and_symbols as sas
for i, r in enumerate(rlist):
tlist[i][0](shape, rlist[i]) # execute symbol creation
shape.insert_text(rlist[i].br + p, # insert description text
tlist[i][1], fontsize=r.height/1.2)
import os
scriptdir = os.path.dirname(__file__)
doc.save(os.path.join(scriptdir, "symbol-list.pdf")) # save the PDF
(New in v1.18.0)
The drawing commands issued by a page can be extracted. Interestingly, this is possible for all supported document
types – not just PDF: so you can use it for XPS, EPUB and others as well.
Page method, Page.get_drawings() accesses draw commands and converts them into a list of Python dictio-
naries. Each dictionary – called a “path” – represents a separate drawing – it may be simple like a single line, or a
complex combination of lines and curves representing one of the shapes of the previous section.
The path dictionary has been designed such that it can easily be used by the Shape class and its methods. Here is an
example for a page with one path, that draws a red-bordered yellow circle inside rectangle Rect(100, 100, 200, 200):
>>> pprint(page.get_drawings())
[{'closePath': True,
'color': [1.0, 0.0, 0.0],
'dashes': '[] 0',
(continues on next page)
Note: You need (at least) 4 Bézier curves (of 3rd order) to draw a circle with acceptable precision. See this ‘Wikipedia
article<https://en.wikipedia.org/wiki/B%C3%A9zier_curve>‘_ for some background.
The following is a code snippet which extracts the drawings of a page and re-draws them on a new page:
import fitz
doc = fitz.open("some.file")
page = doc[0]
paths = page.get_drawings() # extract existing drawings
# this is a list of "paths", which can directly be drawn again using Shape
# -------------------------------------------------------------------------
#
# define some output page with the same dimensions
outpdf = fitz.open()
outpage = outpdf.new_page(width=page.rect.width, height=page.rect.height)
shape = outpage.new_shape() # make a drawing canvas for the output page
# --------------------------------------
# loop through the paths and draw them
# --------------------------------------
for path in paths:
# ------------------------------------
# draw each entry of the 'items' list
# ------------------------------------
for item in path["items"]: # these are the draw commands
if item[0] == "l": # line
shape.draw_line(item[1], item[2])
(continues on next page)
As can bee seen, there is a high congruence level with the Shape class. With one exception: For technical reasons
lineCap is a tuple of 3 numbers here, whereas it is an integer in Shape (and in PDF). So we simply take the maximum
value of that tuple.
Here is a comparison between input and output of an example page, created by the previous script:
Note: The reconstruction of graphics like shown here is not perfect. The following aspects will not be reproduced as
of this version:
• Page definitions can be complex and include instructions for not showing / hiding certain areas to keep them
invisible. Things like this are ignored by Page.get_drawings() - it will always return all paths.
Note: You can use the path list to make your own lists of e.g. all lines or all rectangles on the page, subselect them
by criteria like color or position on the page etc.
4.6 Multiprocessing
MuPDF has no integrated support for threading - they call themselves “threading-agnostic”. While there do exist
tricky possibilities to still use threading with MuPDF, the baseline consequence for PyMuPDF is:
No Python threading support.
Using PyMuPDF in a Python threading environment will lead to blocking effects for the main thread.
However, there exists the option to use Python’s multiprocessing module in a variety of ways.
If you are looking to speed up page-oriented processing for a large document, use this script as a starting point. It
should be at least twice as fast as the corresponding sequential processing.
"""
Demonstrate the use of multiprocessing with PyMuPDF.
def render_page(vector):
"""Render a page range of a document.
Notes:
The PyMuPDF document cannot be part of the argument, because that
cannot be pickled. So we are being passed in just its filename.
(continues on next page)
4.6. Multiprocessing 51
PyMuPDF Documentation, Release 1.19.3
# pages per segment: make sure that cpu * seg_size >= num_pages!
seg_size = int(num_pages / cpu + 1)
seg_from = idx * seg_size # our first page number
seg_to = min(seg_from + seg_size, num_pages) # last page number
if __name__ == "__main__":
t0 = mytime() # start a timer
filename = sys.argv[1]
mat = fitz.Matrix(0.2, 0.2) # the rendering matrix: scale down to 20%
cpu = cpu_count()
Here is a more complex example involving inter-process communication between a main process (showing a GUI)
and a child process doing PyMuPDF access to a document.
"""
Created on 2019-05-01
overkill for most people who might have one or the other, why both?
'''
def module_exists(module_name):
return module_name in (name for loader, name, ispkg in iter_modules())
if module_exists("PyQt6"):
# PyQt6
from PyQt6 import QtGui, QtWidgets, QtCore
from PyQt6.QtCore import pyqtSignal as Signal, pyqtSlot as Slot
wrapper = "PyQt6"
elif module_exists("PySide6"):
# PySide6
from PySide6 import QtGui, QtWidgets, QtCore
from PySide6.QtCore import Signal, Slot
wrapper = "PySide6"
class DocForm(QtWidgets.QWidget):
def __init__(self):
super().__init__()
self.process = None
self.queNum = mp.Queue()
self.queDoc = mp.Queue()
self.page_count = 0
self.curPageNum = 0
self.lastDir = ""
self.timerSend = QtCore.QTimer(self)
self.timerSend.timeout.connect(self.onTimerSendPageNum)
self.timerGet = QtCore.QTimer(self)
(continues on next page)
4.6. Multiprocessing 53
PyMuPDF Documentation, Release 1.19.3
def initUI(self):
vbox = QtWidgets.QVBoxLayout()
self.setLayout(vbox)
hbox = QtWidgets.QHBoxLayout()
self.btnOpen = QtWidgets.QPushButton("OpenDocument", self)
self.btnOpen.clicked.connect(self.openDoc)
hbox.addWidget(self.btnOpen)
vbox.addLayout(hbox)
)
self.labelImg.setSizePolicy(sizePolicy)
vbox.addWidget(self.labelImg)
def openDoc(self):
path, _ = QtWidgets.QFileDialog.getOpenFileName(
self,
"Open Document",
self.lastDir,
"All Supported Files (*.pdf;*.epub;*.xps;*.oxps;*.cbz;*.fb2);;PDF Files
˓→(*.pdf);;EPUB Files (*.epub);;XPS Files (*.xps);;OpenXPS Files (*.oxps);;CBZ Files
#options=QtWidgets.QFileDialog.Options(),
)
if path:
self.lastDir, self.file = os.path.split(path)
if self.process:
self.queNum.put(-1) # use -1 to notify the process to exit
self.timerSend.stop()
self.curPageNum = 0
self.page_count = 0
(continues on next page)
def playDoc(self):
self.timerSend.start(500)
def stopPlay(self):
self.timerSend.stop()
def onTimerSendPageNum(self):
if self.curPageNum < self.page_count - 1:
self.queNum.put(self.curPageNum + 1)
else:
self.timerSend.stop()
def onTimerGetPage(self):
try:
ret = self.queDoc.get(False)
if isinstance(ret, int):
self.timerWaiting.stop()
self.page_count = ret
self.label.setText("{}/{}".format(self.curPageNum + 1, self.page_
˓→count))
fmt = (
QtGui.QImage.Format.Format_RGBA8888
if alpha
else QtGui.QImage.Format.Format_RGB888
)
qimg = QtGui.QImage(samples, width, height, stride, fmt)
self.labelImg.setPixmap(QtGui.QPixmap.fromImage(qimg))
except queue.Empty as ex:
pass
def onTimerWaiting(self):
self.labelImg.setText(
'Loading "{}", {:.2f}s'.format(
self.file, time.perf_counter() - self.startTime
)
)
4.6. Multiprocessing 55
PyMuPDF Documentation, Release 1.19.3
if __name__ == "__main__":
app = QtWidgets.QApplication(sys.argv)
form = DocForm()
sys.exit(app.exec())
4.7 General
If you have a document with a wrong file extension for its type, you can still correctly open it.
Assume that “some.file” is actually an XPS. Open it like so:
Note: MuPDF itself does not try to determine the file type from the file contents. You are responsible for supplying
the filetype info in some way – either implicitly via the file extension, or explicitly as shown. There are pure Python
packages like filetype that help you doing this. Also consult the Document chapter for a full description.
PDF supports incorporating arbitrary data. This can be done in one of two ways: “embedding” or “attaching”.
PyMuPDF supports both options.
1. Attached Files: data are attached to a page by way of a FileAttachment annotation with this statement: annot =
page.add_file_annot(pos, . . . ), for details see Page.add_file_annot(). The first parameter “pos” is the
Point, where a “PushPin” icon should be placed on the page.
2. Embedded Files: data are embedded on the document level via method Document.embfile_add().
The basic differences between these options are (1) you need edit permission to embed a file, but only annotation
permission to attach, (2) like all annotations, attachments are visible on a page, embedded files are not.
There exist several example scripts: embedded-list.py, new-annots.py.
Also look at the sections above and at chapter Appendix 3: Considerations on Embedded Files.
With PyMuPDF you have all options to copy, move, delete or re-arrange the pages of a PDF. Intuitive methods exist
that allow you to do this on a page-by-page level, like the Document.copy_page() method.
Or you alternatively prepare a complete new page layout in form of a Python sequence, that contains the page numbers
you want, in the sequence you want, and as many times as you want each page. The following may illustrate what can
be done with Document.select():
doc.select([1, 1, 1, 5, 4, 9, 9, 9, 0, 2, 2, 2])
Now let’s prepare a PDF for double-sided printing (on a printer not directly supporting this):
The number of pages is given by len(doc) (equal to doc.page_count). The following lists represent the even and the
odd page numbers, respectively:
This snippet creates the respective sub documents which can then be used to print the document:
This snippet duplicates the PDF with itself so that it will contain the pages 0, 1, . . . , n, 0, 1, . . . , n (extremely fast and
without noticeably increasing the file size!):
4.7. General 57
PyMuPDF Documentation, Release 1.19.3
It is easy to join PDFs with method Document.insert_pdf(). Given open PDF documents, you can copy page
ranges from one to the other. You can select the point where the copied pages should be placed, you can revert the
page sequence and also change page rotation. This Wiki article contains a full description.
The GUI script PDFjoiner.py uses this method to join a list of files while also joining the respective table of contents
segments. It looks like this:
There two methods for adding new pages to a PDF: Document.insert_page() and Document.new_page()
(and they share a common code base).
new_page
Document.new_page() returns the created Page object. Here is the constructor showing defaults:
>>> doc = fitz.open(...) # some new or existing PDF document
>>> page = doc.new_page(to = -1, # insertion point: end of document
width = 595, # page dimension: A4 portrait
height = 842)
The above could also have been achieved with the short form page = doc.new_page(). The to parameter specifies the
document’s page number (0-based) in front of which to insert.
To create a page in landscape format, just exchange the width and height values.
Use this to create the page with another pre-defined paper format:
The convenience function paper_size() knows over 40 industry standard paper formats to choose from. To see
them, inspect dictionary paperSizes. Pass the desired dictionary key to paper_size() to retrieve the paper
dimensions. Upper and lower case is supported. If you append “-L” to the format name, the landscape version is
returned.
Note: Here is a 3-liner that creates a PDF with one empty page. Its file size is 470 bytes:
insert_page
Document.insert_page() also inserts a new page and accepts the same parameters to, width and height. But it
lets you also insert arbitrary text into the new page and returns the number of inserted lines:
The text parameter can be a (sequence of) string (assuming UTF-8 encoding). Insertion will start at Point (50, 72),
which is one inch below top of page and 50 points from the left. The number of inserted text lines is returned. See the
method definition for more details.
This shows a potential use of PyMuPDF with another Python PDF library (the excellent pure Python package pdfrw
is used here as an example).
If a clean, non-corrupt / decompressed PDF is needed, one could dynamically invoke PyMuPDF to recover from many
problems like so:
import sys
from io import BytesIO
from pdfrw import PdfReader
import fitz
#---------------------------------------
# 'Tolerant' PDF reader
#---------------------------------------
def reader(fname, password = None):
idata = open(fname, "rb").read() # read the PDF into memory and
(continues on next page)
4.7. General 59
PyMuPDF Documentation, Release 1.19.3
With the command line utility pdftk (available for Windows only, but reported to also run under Wine) a similar result
can be achieved, see here. However, you must invoke it as a separate process via subprocess.Popen, using stdin and
stdout as communication vehicles.
This deals with splitting up pages of a PDF in arbitrary pieces. For example, you may have a PDF with Letter format
pages which you want to print with a magnification factor of four: each page is split up in 4 pieces which each go to a
separate PDF page in Letter format again:
"""
Create a PDF copy with split-up pages (posterize)
---------------------------------------------------
License: GNU AFFERO GPL V3
(c) 2018 Jorj X. McKie
Usage
------
python posterize.py input.pdf
Result
-------
A file "poster-input.pdf" with 4 output pages for every input page.
Notes
-----
(1) Output file is chosen to have page dimensions of 1/4 of input.
(2) Easily adapt the example to make n pages per input, or decide per each
input page or whatever.
This deals with joining PDF pages to form a new PDF with pages each combining two or four original ones (also
called “2-up”, “4-up”, etc.). This could be used to create booklets or thumbnail-like overviews:
4.7. General 61
PyMuPDF Documentation, Release 1.19.3
'''
Copy an input PDF to output combining every 4 pages
---------------------------------------------------
License: GNU AFFERO GPL V3
(c) 2018 Jorj X. McKie
Usage
------
python 4up.py input.pdf
Result
-------
A file "4up-input.pdf" with 1 output page for every 4 input pages.
Notes
-----
(1) Output file is chosen to have A4 portrait pages. Input pages are scaled
maintaining side proportions. Both can be changed, e.g. based on input
page size. However, note that not all pages need to have the same size, etc.
(2) Easily adapt the example to combine just 2 pages (like for a booklet) or
make the output page dimension dependent on input, or whatever.
Dependencies
-------------
PyMuPDF 1.12.1 or later
'''
import fitz, sys
infile = sys.argv[1]
src = fitz.open(infile)
doc = fitz.open() # empty output PDF
# by all means, save new file using garbage collection and compression
doc.save("4up-" + infile, garbage=3, deflate=True)
Example effect:
Here is a script that converts any PyMuPDF supported document to a PDF. These include XPS, EPUB, FB2, CBZ and
all image formats, including multi-page TIFF images.
It features maintaining any metadata, table of contents and links contained in the source document:
"""
Demo script: Convert input file to a PDF
-----------------------------------------
Intended for multi-page input files like XPS, EPUB etc.
Features:
---------
Recovery of table of contents and links of input file.
While this works well for bookmarks (outlines, table of contents),
links will only work if they are not of type "LINK_NAMED".
This link type is skipped by the script.
For XPS and EPUB input, internal links however **are** of type "LINK_NAMED".
Base library MuPDF does not resolve them to page numbers.
So, for anyone expert enough to know the internal structure of these
document types, can further interpret and resolve these link types.
Dependencies
--------------
PyMuPDF v1.14.0+
"""
import sys
import fitz
if not (list(map(int, fitz.VersionBind.split("."))) >= [1,14,0]):
raise SystemExit("need PyMuPDF v1.14.0+")
fn = sys.argv[1]
doc = fitz.open(fn)
4.7. General 63
PyMuPDF Documentation, Release 1.19.3
if not meta["creator"]:
meta["creator"] = "PyMuPDF PDF converter"
meta["modDate"] = fitz.get_pdf_now()
meta["creationDate"] = meta["modDate"]
pdf.set_metadata(meta)
Since PyMuPDF v1.16.0, error messages issued by the underlying MuPDF library are being redirected to the Python
standard device sys.stderr. So you can handle them like any other output going to this devices.
In addition, these messages go to the internal buffer together with any MuPDF warnings – see below.
We always prefix these messages with an identifying string “mupdf:”. If you prefer to not see recoverable MuPDF
errors at all, issue the command fitz.TOOLS.mupdf_display_errors(False).
MuPDF warnings continue to be stored in an internal buffer and can be viewed using Tools.mupdf_warnings().
Please note that MuPDF errors may or may not lead to Python exceptions. In other words, you may see error messages
from which MuPDF can recover and continue processing.
Example output for a recoverable error. We are opening a damaged PDF, but MuPDF is able to repair it and gives
us a few information on what happened. Then we illustrate how to find out whether the document can later be saved
incrementally. Checking the Document.is_dirty attribute at this point also indicates that the open had to repair
the document:
Starting with version 1.16.0, PDF decryption and encryption (using passwords) are fully supported. You can do the
following:
• Check whether a document is password protected / (still) encrypted (Document.needs_pass, Document.
is_encrypted).
• Gain access authorization to a document (Document.authenticate()).
• Set encryption details for PDF files using Document.save() or Document.write() and
– decrypt or encrypt the content
– set password(s)
– set the encryption method
– set permission details
4.7. General 65
PyMuPDF Documentation, Release 1.19.3
• The owner password provides full access rights, including changing passwords, encryption method, or permis-
sion detail.
• The user password provides access to document content according to the established permission details. If
present, opening the PDF in a viewer will require providing it.
Method Document.authenticate() will automatically establish access rights according to the password used.
The following snippet creates a new PDF and encrypts it with separate user and owner passwords. Permissions are
granted to print, copy and annotate, but no changes are allowed to someone authenticating with the user password:
import fitz
Opening this document with some viewer (Nitro Reader 5) reflects these settings:
Decrypting will automatically happen on save as before when no encryption parameters are provided.
To keep the encryption method of a PDF save it using encryption=fitz.PDF_ENCRYPT_KEEP. If
doc.can_save_incrementally() == True, an incremental save is also possible.
To change the encryption method specify the full range of options above (encryption, owner_pw, user_pw, permis-
sions). An incremental save is not possible in this case.
4.8.1.1 Problem
4.8.1.2 Cause
Annotation maintenance is handled differently by each PDF maintenance application. Some annotation types may not
be supported, or not be supported fully or some details may be handled in a different way than in another application.
There is no standard.
Almost always a PDF application also comes with its own icons (file attachments, sticky notes and stamps) and its
own set of supported text fonts. For example:
• (Py-) MuPDF only supports these 5 basic fonts for ‘FreeText’ annotations: Helvetica, Times-Roman, Courier,
ZapfDingbats and Symbol – no italics / no bold variations. When changing a ‘FreeText’ annotation created by
some other app, its font will probably not be recognized nor accepted and be replaced by Helvetica.
• PyMuPDF supports all PDF text markers (highlight, underline, strikeout, squiggly), but these types cannot be
updated with Adobe Acrobat Reader.
In most cases there also exists limited support for line dashing which causes existing dashes to be replaced by straight
lines. For example:
• PyMuPDF fully supports all line dashing forms, while other viewers only accept a limited subset.
4.8.1.3 Solutions
4.8.2.1 Problem
You inserted an item (like an image, an annotation or some text) on an existing PDF page, but later you find it being
placed at a different location than intended. For example an image should be inserted at the top, but it unexpectedly
appears near the bottom of the page.
4.8.2.2 Cause
The creator of the PDF has established a non-standard page geometry without keeping it “local” (as they should!).
Most commonly, the PDF standard point (0,0) at bottom-left has been changed to the top-left point. So top and bottom
are reversed – causing your insertion to be misplaced.
The visible image of a PDF page is controlled by commands coded in a special mini-language. For an overview of
this language consult “Operator Summary” on pp. 643 of the Adobe PDF References. These commands are stored in
contents objects as strings (bytes in PyMuPDF).
There are commands in that language, which change the coordinate system of the page for all the following commands.
In order to limit the scope of such commands local, they must be wrapped by the command pair q (“save graphics
state”, or “stack”) and Q (“restore graphics state”, or “unstack”).
So the PDF creator did this:
stream
1 0 0 -1 0 792 cm % <=== change of coordinate system:
... % letter page, top / bottom reversed
... % remains active beyond these lines
endstream
stream
q % put the following in a stack
1 0 0 -1 0 792 cm % <=== scope of this is limited by Q command
... % here, a different geometry exists
Q % after this line, geometry of outer scope prevails
endstream
Note:
• In the mini-language’s syntax, spaces and line breaks are equally accepted token delimiters.
• Multiple consecutive delimiters are treated as one.
• Keywords “stream” and “endstream” are inserted automatically – not by the programmer.
4.8.2.3 Solutions
Since v1.16.0, there is the property Page.is_wrapped, which lets you check whether a page’s contents are
wrapped in that string pair.
If it is False or if you want to be on the safe side, pick one of the following:
1. The easiest way: in your script, do a Page.clean_contents() before you do your first item insertion.
2. Pre-process your PDF with the MuPDF command line utility mutool clean -c . . . and work with its output file
instead.
3. Directly wrap the page’s contents with the stacking commands before you do your first item insertion.
Solutions 1. and 2. use the same technical basis and do a lot more than what is required in this context: they also
clean up other inconsistencies or redundancies that may exist, multiple /Contents objects will be concatenated into
one, and much more.
Note: For incremental saves, solution 1. has an unpleasant implication: it will bloat the update delta, because
it changes so many things and, in addition, stores the cleaned contents uncompressed. So, if you use Page.
clean_contents() you should consider saving to a new file with (at least) garbage=3 and deflate=True.
Solution 3. is completely under your control and only does the minimum corrective action. There exists a handy
low-level utility function which you can use for this. Suggested procedure:
• Prepend the missing stacking command by executing fitz.TOOLS._insert_contents(page, b”qn”, False).
• Append an unstacking command by executing fitz.TOOLS._insert_contents(page, b”nQ”, True).
• Alternatively, just use Page._wrap_contents(), which executes the previous two functions.
Note: If small incremental update deltas are a concern, this approach is the most effective. Other contents objects are
not touched. The utility method creates two new PDF stream objects and inserts them before, resp. after the page’s
other contents. We therefore recommend the following snippet to get this situation under control:
Fairly often, text extraction does not work text as you would expect: text may be missing at all, or may not appear in
the reading sequence visible on your screen, or contain garbled characters (like a ? or a “TOFU” symbol), etc. This
can be caused by a number of different problems.
Your PDF viewer does display text, but you cannot select it with your cursor, and text extraction delivers nothing.
4.8.3.2 Cause
1. You may be looking at an image embedded in the PDF page (e.g. a scanned PDF).
2. The PDF creator used no font, but simulated text by painting it, using little lines and curves. E.g. a capital “D”
could be painted by a line “|” and a left-open semi-circle, an “o” by an ellipse, and so on.
4.8.3.3 Solution
Use an OCR software like OCRmyPDF to insert a hidden text layer underneath the visible page. The resulting PDF
should behave as expected.
Text extraction does not deliver the text in readable order, duplicates some text, or is otherwise garbled.
4.8.3.5 Cause
1. The single characters are redable as such (no “<?>” symbols), but the sequence in which the text is coded in
the file deviates from the reading order. The motivation behind may be technical or protection of data against
unwanted copies.
2. Many “<?>” symbols occur, indicating MuPDF could not interpret these characters. The font may indeed be
unsupported by MuPDF, or the PDF creator may haved used a font that displays readable text, but on purpose
obfuscates the originating corresponding unicode character.
4.8.3.6 Solution
Numerous methods are available to access and manipulate PDF files on a fairly low level. Admittedly, a clear distinc-
tion between “low level” and “normal” functionality is not always possible or subject to personal taste.
It also may happen, that functionality previously deemed low-level is later on assessed as being part of the normal
interface. This has happened in v1.14.0 for the class Tools – you now find it as an item in the Classes chapter.
Anyway – it is a matter of documentation only: in which chapter of the documentation do you find what. Everything
is available always and always via the same interface.
A PDF’s xref table is a list of all objects defined in the file. This table may easily contain many thousand entries
– the manual Adobe PDF References for example has 127’000 objects. Table entry “0” is reserved and must not be
touched. The following script loops through the xref table and prints each object’s definition:
>>
Some object types contain additional data apart from their object definition. Examples are images, fonts, embedded
files or commands describing the appearance of a page.
Objects of these types are called “stream objects”. PyMuPDF allows reading an object’s stream via method
Document.xref_stream() with the object’s xref as an argument. It is also possible to write back a modi-
fied version of a stream using Document.update_stream().
Assume that the following snippet wants to read all streams of a PDF for whatever reason:
A PDF page can have zero or multiple contents objects. These are stream objects describing what appears where
and how on a page (like text and images). They are written in a special mini-language described e.g. in chapter
“APPENDIX A - Operator Summary” on page 643 of the Adobe PDF References.
Every PDF reader application must be able to interpret the contents syntax to reproduce the intended appearance of
the page.
If multiple contents objects are provided, they must be interpreted in the specified sequence in exactly the same
way as if they were provided as a concatenation of the several.
There are good technical arguments for having multiple contents objects:
• It is a lot easier and faster to just add new contents objects than maintaining a single big one (which entails
reading, decompressing, modifying, recompressing, and rewriting it for each change).
• When working with incremental updates, a modified big contents object will bloat the update delta and can
thus easily negate the efficiency of incremental saves.
For example, PyMuPDF adds new, small contents objects in methods Page.insert_image(), Page.
show_pdf_page() and the Shape methods.
However, there are also situations when a single contents object is beneficial: it is easier to interpret and better
compressible than multiple smaller ones.
Here are two ways of combining multiple contents of a page:
The clean function Page.clean_contents() does a lot more than just glueing contents objects: it also
corrects and optimizes the PDF operator syntax of the page and removes any inconsistencies with the page’s object
definition.
This is a central (“root”) object of a PDF. It serves as a starting point to reach important other objects and it also
contains some global options for the PDF:
Note: Indentation, line breaks and comments are inserted here for clarification purposes only and will not normally
appear. For more information on the PDF catalog see section 7.7.2 on page 71 of the Adobe PDF References.
The trailer of a PDF file is a dictionary located towards the end of the file. It contains special objects, and pointers
to important other information. See Adobe PDF References p. 42. Here is an overview:
Access this information via PyMuPDF with Document.pdf_trailer() or, equivalently, via Document.
xref_object() using -1 instead of a valid xref number.
A PDF may contain XML metadata in addition to the standard metadata format. In fact, most PDF viewer or modifi-
cation software adds this type of information when saving the PDF (Adobe, Nitro PDF, PDF-XChange, etc.).
PyMuPDF has no way to interpret or change this information directly, because it contains no XML features. XML
metadata is however stored as a stream object, so it can be read, modified with appropriate software and written
back.
Using some XML package, the XML data can be interpreted and / or modified and then stored back. The following
also works, if the PDF previously had no XML metadata:
Attribute Document.metadata is designed so it works for all supported document types in the same way: it is
a Python dictionary with a fixed set of key-value pairs. Correspondingly, Document.set_metadata() only
accepts standard keys.
However, PDFs may contain items not accessible like this. Also, there may be reasons to store additional information,
like copyrights. Here is a way to handle arbitrary metadata items by using PyMuPDF low-level functions.
As an example, look at this standard metadata output of some PDF:
# ---------------------
# standard metadata
# ---------------------
pprint(doc.metadata)
{'author': 'PRINCE',
'creationDate': "D:2010102417034406'-30'",
(continues on next page)
Use the following code to see all items stored the metadata object:
# ----------------------------------
# metadata including private items
# ----------------------------------
metadata = {} # make my own metadata dict
what, value = doc.xref_get_key(-1, "Info") # /Info key in the trailer
if what != "xref":
pass # PDF has no metadata
else:
xref = int(value.replace("0 R", "")) # extract the metadata xref
for key in doc.xref_get_keys(xref):
metadata[key] = doc.xref_get_key(xref, key)[1]
pprint(metadata)
{'Author': 'PRINCE',
'CreationDate': "D:2010102417034406'-30'",
'Creator': 'PrimoPDF http://www.primopdf.com/',
'ModDate': "D:20200725062431-04'00'",
'PXCViewerInfo': 'PDF-XChange Viewer;2.5.312.1;Feb 9 '
"2015;12:00:06;D:20200725062431-04'00'",
'Producer': 'macOS Version 10.15.6 (Build 19G71a) Quartz PDFContext, '
'AppendMode 1.1',
'Title': 'Full page fax print'}
# ---------------------------------------------------------------
# note the additional 'PXCViewerInfo' key - ignored in standard!
# ---------------------------------------------------------------
Vice cersa, you can also store private metadata items in a PDF. It is your responsibility making sure, that these items
do conform to PDF specifications - especially they must be (unicode) strings. Consult section 14.3 (p. 548) of the
Adobe PDF References for details and caveats:
To delete selected keys, use doc.xref_set_key(xref, "mykey", "null"). As explained in the next sec-
tion, string “null” is the PDF equivalent to Python’s None. A key with that value will be treated like being not specified
– and physically removed in garbage collections.
There also exist granular, elegant ways to access and manipulate selected PDF dictionary keys.
• Document.xref_get_keys() returns the PDF keys of the object at xref:
In [6]: print(doc.xref_object(page.xref))
<<
/Type /Page
/Contents 1297 0 R
/Resources 1296 0 R
/MediaBox [ 0 0 612 792 ]
/Parent 1301 0 R
>>
• Single keys can also be accessed directly via Document.xref_get_key(). The value always is a string
together with type information, that helps interpreting it:
• An undefined key inquiry returns ('null', 'null') – PDF object type null corresponds to None in
Python. Similar for the booleans true and false.
• Let us add a new key to the page definition that sets its rotation to 90 degrees (you are aware that there actually
exists Page.set_rotation() for this?):
• This method can also be used to remove a key from the xref dictionary by setting its value to null:
The following will remove the rotation specification from the page: doc.xref_set_key(page.xref,
"Rotate", "null"). Similarly, to remove all links, annotations and fields from a page, use doc.
xref_set_key(page.xref, "Annots", "null"). Because Annots by definition is an array, set-
ting en empty array with the statement doc.xref_set_key(page.xref, "Annots", "[]") would
do the same job in this case.
• PDF dictionaries can be hierarchically nested. In the following page object definition both, Font and XObject
are subdictionaries of Resources:
In [15]: print(doc.xref_object(page.xref))
<<
/Type /Page
/Contents 1297 0 R
/Resources <<
/XObject <<
/Im1 1291 0 R
>>
/Font <<
/F39 1299 0 R
/F40 1300 0 R
>>
>>
/MediaBox [ 0 0 612 792 ]
/Parent 1301 0 R
/Rotate 90
>>
• The path notation can also be used to directly set a value: use the following to let Im1 point to a different
object:
Be aware, that no semantic checks whatsoever will take place here: if the PDF has no xref 9999, it won’t be
detected at this point.
• If a key does not exist, it will be created by setting its value. Moreover, if any intermediate keys do not exist
either, they will also be created as necessary. The following creates an array D several levels below the existing
dictionary A. Intermediate dictionaries B and C are automatically created:
• When setting key values, basic PDF syntax checking will be done by MuPDF. For example, new keys can
only be created below a dictionary. The following tries to create some new string item E below the previously
created array D:
• It is also not possible, to create a key if some higher level key is an “indirect” object, i.e. an xref. In other
words, xrefs can only be modified directly and not implicitely via other objects referencing them:
Caution: These are expert functions! There are no validations as to whether valid PDF objects, xrefs, etc. are
specified. As with other low-level methods there exists the risk to render the PDF, or parts of it unusable.
4.10 Journalling
Starting with version 1.19.0, journalling is possible when updating PDF documents.
Journalling is a logging mechanism which permits either reverting or re-applying changes to a PDF. Similar to LUWs
“Logical Units of Work” in modern database systems, one can group a set of updates into an “operation”. In MuPDF
journalling, an operation plays the role of a LUW.
Note: In contrast to LUW implementations found in database systems, MuPDF journalling happens on a per doc-
ument level. There is no support for simultaneous updates across multiple PDFs: one would have to establish one’s
own logic here.
• Journalling must be enabled via a document method. Journalling is possible for existing or new documents.
Journalling can be disabled only by closing the file.
• Once enabled, every change must happen inside an operation – otherwise an exception is raised. An operation
is started and stopped via document methods. Updates happening between these two calls form an LUW and
can thus collectively be rolled back or re-applied, or, in MuPDF terminology “undone” resp. “redone”.
• At any point, the journalling status can be queried: whether journalling is active, how many operations have
been recorded, whether “undo” or “redo” is possible, the current position inside the journal, etc.
• The journal can be saved to or loaded from a file. These are document methods.
• When loading a journal file, compatibility with the document is checked and journalling is automatically enabled
upon success.
• For an exising PDF being journalled, a special new save method is available: Document.
save_snapshot(). This performs a special incremental save that includes all journalled updates so far.
If its journal is saved at the same time (immediately after the document snapshot), then document and journal
are in sync and can lateron be used together to undo or redo operations or to continue journalled updates – just
as if there had been no interruption.
• The snapshot PDF is a valid PDF in every aspect and fully usable. If the document is however changed in any
way without using its journal file, then a desynchronization will take place and the journal is rendered unusable.
• Snapshot files are structured like incremental updates. Nevertheless, the internal journalling logic requires, that
saving must happen to a new file. So the user should develop a file naming convention to support recognizable
4.10. Journalling 79
PyMuPDF Documentation, Release 1.19.3
relationships between an original PDF, like original.pdf and its snapshot sets, like original-snap1.
pdf / original-snap1.log, original-snap2.pdf / original-snap2.log, etc.
Description:
• Make a new PDF and enable journalling. Then add a page and some text lines – each as a separate operation.
• Navigate within the journal, undoing and redoing these updates and diplaying status and file results:
>>> doc.journal_start_op("op1")
>>> page = doc.new_page()
>>> doc.journal_stop_op()
>>> doc.journal_start_op("op2")
>>> page.insert_text((100,100), "Line 1")
>>> doc.journal_stop_op()
>>> doc.journal_start_op("op3")
>>> page.insert_text((100,120), "Line 2")
>>> doc.journal_stop_op()
>>> doc.journal_start_op("op4")
>>> page.insert_text((100,140), "Line 3")
>>> doc.journal_stop_op()
Description:
• Similar to previous, but after undoing some operations, we now add a different update. This will cause:
– permanent removal of the undone journal entries
– the new update operation will become the new last entry.
>>> doc=fitz.open()
>>> doc.journal_enable()
>>> doc.journal_start_op("Page insert")
>>> page=doc.new_page()
>>> doc.journal_stop_op()
>>> for i in range(5):
doc.journal_start_op("insert-%i" % i)
page.insert_text((100, 100 + 20*i), "text line %i" %i)
doc.journal_stop_op()
4.10. Journalling 81
PyMuPDF Documentation, Release 1.19.3
Module fitz
5.1 Invocation
General remarks:
• Request help via "-h", resp. command-specific help via "command -h".
• Parameters may be abbreviated where this does not introduce ambiguities.
• Several commands support parameters -pages and -xrefs. They are intended for down-selection. Please
note that:
– page numbers for this utility must be given 1-based.
– valid xref numbers start at 1.
– Specify a comma-separated list of either single integers or integer ranges. A range is a pair of integers
separated by one hyphen “-“. Integers must not exceed the maximum page, resp. xref number. To specify
that maximum, the symbolic variable “N” may be used. Integers or ranges may occur several times, in
any sequence and may overlap. If in a range the first number is greater than the second one, the respective
items will be processed in reversed order.
• How to use the module inside your script:
83
PyMuPDF Documentation, Release 1.19.3
• Use the following 2-liner and compile it with Nuitka in standalone mode. This will give you a CLI executable
with all the module’s features, that can be used on all compatible platforms without Python, PyMuPDF or
MuPDF being installed.
This command will optimize the PDF and store the result in a new file. You can use it also for encryption, decryption
and creating sub documents. It is mostly similar to the MuPDF command line utility “mutool clean”:
positional arguments:
input PDF filename
output output PDF filename
optional arguments:
-h, --help show this help message and exit
-password PASSWORD password
-encryption {keep,none,rc4-40,rc4-128,aes-128,aes-256}
encryption method
-owner OWNER owner password
-user USER user password
-garbage {0,1,2,3,4} garbage collection level
-compress compress (deflate) output
-ascii ASCII encode binary data
-linear format for fast web display
-permission PERMISSION
integer with permission levels
-sanitize sanitize / clean contents
-pretty prettify PDF structure
-pages PAGES output selected pages, format: 1,5-7,50-N
If you specify “-pages”, be aware that only page-related objects are copied, no document-level items like e.g. em-
bedded files.
Please consult Document.save() for the parameter meanings.
positional arguments:
input PDF filename
optional arguments:
-h, --help show this help message and exit
-images extract images
-fonts extract fonts
-output OUTPUT output directory, defaults to current
-password PASSWORD password
-pages PAGES only consider these pages, format: 1,5-7,50-N
Image filenames are built according to the naming scheme: “img-xref.ext”, where “ext” is the extension associated
with the image and “xref” the xref of the image PDF object.
Font filenames consist of the fontname and the associated extension. Any spaces in the fontname are replaced with
hyphens “-“.
The output directory must already exist.
Note: Except for output directory creation, this feature is functionally equivalent to and obsoletes this script.
positional arguments:
input input filenames
optional arguments:
-h, --help show this help message and exit
-output OUTPUT output filename
Note:
1. Each input must be entered as “filename,password,pages”. Password and pages are optional.
2. The password entry is required if the “pages” entry is used. If the PDF needs no password, specify two commas.
3. The “pages” format is the same as explained at the top of this section.
4. Each input file is immediately closed after use. Therefore you can use one of them as output filename, and thus
overwrite it.
Display PDF internal information. Again, there are similarities to “mutool show”:
positional arguments:
input PDF filename
optional arguments:
-h, --help show this help message and exit
-password PASSWORD password
-catalog show PDF catalog
-trailer show PDF trailer
-metadata show PDF metadata
-xrefs XREFS show selected objects, format: 1,5-7,N
-pages PAGES show selected pages, format: 1,5-7,50-N
Examples:
The following commands deal with embedded files – which is a feature completely removed from MuPDF after v1.14,
and hence from all its command line tools.
5.6.1 Information
positional arguments:
input PDF filename
optional arguments:
(continues on next page)
Example:
20110813_180956_0002.jpg
20110813_181009_0003.jpg
20110813_181012_0004.jpg
20110813_181131_0005.jpg
20110813_181144_0006.jpg
20110813_181306_0007.jpg
20110813_181307_0008.jpg
20110813_181314_0009.jpg
20110813_181315_0010.jpg
20110813_181324_0011.jpg
20110813_181339_0012.jpg
20110813_181913_0013.jpg
insta-20110813_180944_0001.jpg
markiert-20110813_180944_0001.jpg
neue.datei
name: neue.datei
filename: text-tester.pdf
ufilename: text-tester.pdf
desc: nur zum Testen!
size: 4639
length: 1566
5.6.2 Extraction
positional arguments:
input PDF filename
optional arguments:
-h, --help show this help message and exit
-name NAME name of entry
-password PASSWORD password
-output OUTPUT output filename, default is stored name
5.6.3 Deletion
positional arguments:
input PDF filename
optional arguments:
-h, --help show this help message and exit
-password PASSWORD password
-output OUTPUT output PDF filename, incremental save if none
-name NAME name of entry to delete
5.6.4 Insertion
positional arguments:
input PDF filename
optional arguments:
-h, --help show this help message and exit
-password PASSWORD password
-output OUTPUT output PDF filename, incremental save if none
-name NAME name of new entry
-path PATH path to data for new entry
-desc DESC description of new entry
“NAME” must not already exist in the PDF. For details consult Document.embfile_add().
5.6.5 Updates
positional arguments:
input PDF filename
optional arguments:
-h, --help show this help message and exit
-name NAME name of entry
-password PASSWORD password
-output OUTPUT Output PDF filename, incremental save if none
-path PATH path to new data for entry
-filename FILENAME new filename to store in entry
-ufilename UFILENAME new unicode filename to store in entry
-desc DESC new description to store in entry
Use this method to change meta-information of the file – just omit the “PATH”. For details consult Document.
embfile_upd().
5.6.6 Copying
positional arguments:
input PDF to receive embedded files
optional arguments:
-h, --help show this help message and exit
-password PASSWORD password of input
-output OUTPUT output PDF, incremental save to 'input' if omitted
-source SOURCE copy embedded files from here
-pwdsource PWDSOURCE password of 'source' PDF
-name [NAME [NAME ...]]
restrict copy to these entries
(New in v1.18.16)
Extract text from arbitrary supported documents (not only PDF) to a textfile. Currently, there are three output format-
ting modes available: simple, block sorting and reproduction of physical layout.
• Simple text extraction reproduces all text as it appears in the document pages – no effort is made to rearrange
in any particular reading order.
• Block sorting sorts text blocks (as identified by MuPDF) by ascending vertical, then horizontal coordinates.
This should be sufficient to establish a “natural” reading order for basic pages of text.
• Layout strives to reproduce the original appearance of the input pages. You can expect results like this (produced
by the command python -m fitz gettext -pages 1 demo1.pdf):
Note: The “gettext” command offers a functionality similar to the CLI tool pdftotext by XPDF software, http://
www.foolabs.com/xpdf/ – this is especially true for “layout” mode, which combines that tool’s -layout and -table
options.
After each page of the output file, a formfeed character, hex(12) is written – even if the input page has no text at all.
This behavior can be controlled via options.
Note: For “layout” mode, only horizontal, left-to-right, top-to bottom text is supported, other text is ignored. In
this mode, text is also ignored, if its fontsize is too small.
“Simple” and “blocks” mode in contrast output all text for any text size or orientation.
Command:
python -m fitz gettext -h
usage: fitz gettext [-h] [-password PASSWORD] [-mode {simple,blocks,layout}] [-pages
˓→PAGES] [-noligatures]
positional arguments:
input input document filename
optional arguments:
-h, --help show this help message and exit
-password PASSWORD password for input document
-mode {simple,blocks,layout}
mode: simple, block sort, or layout (default)
-pages PAGES select pages, format: 1,5-7,50-N
-noligatures expand ligature characters (default False)
-convert-white convert whitespace characters to space (default False)
-extra-spaces fill gaps with spaces (default False)
-noformfeed write linefeeds, no formfeeds (default False)
-skip-empty suppress pages with no text (default False)
-output OUTPUT store text in this file (default inputfilename.txt)
-grid GRID merge lines if closer than this (default 2)
-fontsize FONTSIZE only include text with a larger fontsize (default 3)
Note: Command options may be abbreviated as long as no ambiguities are introduced. So the following do the same:
• ... -output text.txt -noligatures -noformfeed -convert-white -grid 3
-extra-spaces ...
• ... -o text.txt -nol -nof -c -g 3 -e ...
The output filename defaults to the input with its extension replaced by .txt. As with other commands, you can
select page ranges (caution: 1-based!) in mutool format, as indicated above.
Classes
6.1 Annot
Note: Unfortunately, there exists no single, unique naming convention in PyMuPDF: examples for all of CamelCases,
mixedCases and lower_case_with underscores can be found all over the place. We are now in the process of cleaning
this up, step by step.
This class, Annot, is the first candidate for this execise. In this chapter, you will for example find Annot.
get_pixmap() – and no longer the old name getPixmap. The method with the old name however continues
to exists and you can continue using it: your existing code will not break. But we do hope you will start using the new
names – for new code at least.
93
PyMuPDF Documentation, Release 1.19.3
Class API
class Annot
94 Chapter 6. Classes
PyMuPDF Documentation, Release 1.19.3
Note: If the annotation has just been created or modified, you should reload the page first via page =
doc.reload_page(page).
6.1. Annot 95
PyMuPDF Documentation, Release 1.19.3
Note:
• While ‘FreeText’, ‘Line’, ‘PolyLine’, and ‘Polygon’ annotations can have these properties, (Py-)
MuPDF does not support line ends for ‘FreeText’, because the call-out variant of it is not supported.
• (Changed in v1.16.16) Some symbols have an interior area (diamonds, circles, squares, etc.). By
default, these areas are filled with the fill color of the annotation. If this is None, then white is chosen.
The fill_color argument of Annot.update() can now be used to override this and give line end
symbols their own fill color.
Parameters
• start (int) – The symbol number for the first point.
• end (int) – The symbol number for the last point.
set_oc(xref )
Set the annotation’s visibility using PDF optional content mechanisms. This visibility is controlled by the
user interface of supporting PDF viewers. It is independent from other attributes like Annot.flags.
Parameters xref (int) – the xref of an optional contents group (OCG or OCMD). Any
previous xref will be overwritten. If zero, a previous entry will be removed. An exception
occurs if the xref is not zero and does not point to a valid PDF object.
get_oc()
Return the xref of an optional content object, or zero if there is none.
Returns zero or the xref of an OCG (or OCMD).
set_irt_xref(xref )
• New in v1.19.3
Set annotation to be “In Response To” another one.
Parameters xref (int) – The xref of another annotation.
96 Chapter 6. Classes
PyMuPDF Documentation, Release 1.19.3
Note: Must refer to an existing annotation on this page. Setting this property requires no
subsequent update().
set_open(value)
(New in v1.18.4)
Set the annotation’s Popup annotation to open or closed – or the annotation itself, if its type is ‘Text’
(“sticky note”).
Parameters value (bool) – the desired open state.
set_popup(rect)
(New in v1.18.4)
Create a Popup annotation for the annotation and specify its rectangle. If the Popup already exists, only its
rectangle is updated.
Parameters rect (rect_like) – the desired rectangle.
set_opacity(value)
Set the annotation’s transparency. Opacity can also be set in Annot.update().
Parameters value (float) – a float in range [0, 1]. Any value outside is assumed to be 1.
E.g. a value of 0.5 sets the transparency to 50%.
Three overlapping ‘Circle’ annotations with each opacity set to 0.5:
blendmode
(New in v1.18.4)
The annotation’s blend mode. See Adobe PDF References, page 324 for explanations.
Return type str
Returns
the blend mode or None.
>>> annot=page.first_annot
>>> annot.blendmode
'Multiply'
set_blendmode(blendmode)
(New in v1.16.14) Set the annotation’s blend mode. See Adobe PDF References, page 324 for explanations.
The blend mode can also be set in Annot.update().
6.1. Annot 97
PyMuPDF Documentation, Release 1.19.3
Parameters blendmode (str) – set the blend mode. Use Annot.update() to reflect
this in the visual appearance. For predefined values see PDF Standard Blend Modes. Use
PDF_BM_Normal to remove a blend mode.
>>> annot.set_blendmode(fitz.PDF_BM_Multiply)
>>> annot.update()
>>> # or in one statement:
>>> annot.update(blend_mode=fitz.PDF_BM_Multiply, ...)
set_name(name)
(New in version 1.16.0) Change the name field of any annotation type. For ‘FileAttachment’ and ‘Text’
annotations, this is the icon name, for ‘Stamp’ annotations the text in the stamp. The visual result (if any)
depends on your PDF viewer. See also Annotation Icons in MuPDF.
Parameters name (str) – the new name.
Caution: If you set the name of a ‘Stamp’ annotation, then this will not change the rectangle, nor will
the text be layouted in any way. If you choose a standard text from Stamp Annotation Icons (the exact
name piece after “STAMP_”), you should receive the original layout. An arbitrary text will not be
changed to upper case, but be written in font “Times-Bold” as is, horizontally centered in one line and
be shortened to fit. To get your text fully displayed, its length using fontsize 20 must not exceed 190 pix-
els. So please make sure that the following inequality is true: fitz.get_text_length(text,
fontname="tibo", fontsize=20) <= 190.
set_rect(rect)
Change the rectangle of an annotation. The annotation can be moved around and both sides of the rectangle
can be independently scaled. However, the annotation appearance will never get rotated, flipped or sheared.
Parameters rect (rect_like) – the new rectangle of the annotation (finite and not empty).
E.g. using a value of annot.rect + (5, 5, 5, 5) will shift the annot position 5 pixels to the right
and downwards.
Note: You need not invoke Annot.update() for activation of the effect.
set_rotation(angle)
Set the rotation of an annotation. This rotates the annotation rectangle around its center point. Then a new
annotation rectangle is calculated from the resulting quad.
Parameters angle (int) – rotation angle in degrees. Arbitrary values are possible, but will be
clamped to the interval 0 <= angle < 360.
Note:
• You must invoke Annot.update() to activate the effect.
• For PDF_ANNOT_FREE_TEXT, only one of the values 0, 90, 180 and 270 is possible and will rotate
the text inside the current rectangle (which remains unchanged). Other values are silently ignored and
replaced by 0.
• Otherwise, only the following Annotation Types can be rotated: ‘Square’, ‘Circle’, ‘Caret’, ‘Text’,
‘FileAttachment’, ‘Ink’, ‘Line’, ‘Polyline’, ‘Polygon’, and ‘Stamp’. For all others the method is a
no-op.
98 Chapter 6. Classes
PyMuPDF Documentation, Release 1.19.3
6.1. Annot 99
PyMuPDF Documentation, Release 1.19.3
Color specifications may be made in the usual format used in PuMuPDF as sequences of floats ranging
from 0.0 to 1.0 (including both). The sequence length must be 1, 3 or 4 (supporting GRAY, RGB and
CMYK colorspaces respectively). For mono-color, just a float is also acceptable and yields some shade of
gray.
Parameters
• opacity (float) – (new in v1.16.14) valid for all annotation types: change or set the
annotation’s transparency. Valid values are 0 <= opacity < 1.
• blend_mode (str) – (new in v1.16.14) valid for all annotation types: change or set
the annotation’s blend mode. For valid values see PDF Standard Blend Modes.
• fontsize (float) – change font size of the text. ‘FreeText’ annotations only.
• text_color (sequence,float) – change the text color. ‘FreeText’ annotations
only.
• border_color (sequence,float) – change the border color. ‘FreeText’ annota-
tions only.
• fill_color (sequence,float) – the fill color.
– ’Line’, ‘Polyline’, ‘Polygon’ annotations: use it to give applicable line end symbols a
fill color other than that of the annotation (changed in v1.16.16).
• cross_out (bool) – (new in v1.17.2) add two diagonal lines to the annotation rectangle.
‘Redact’ annotations only. If not desired, False must be specified even if the annotation
was created with False.
• rotate (int) – new rotation value. Default (-1) means no change. Supports ‘FreeText’
and several other annotation types (see Annot.set_rotation()),1 . Only choose 0,
90, 180, or 270 degrees for ‘FreeText’. Otherwise any integer is acceptable.
Return type bool
file_info()
Basic information of the annot’s attached file.
Return type dict
Returns a dictionary with keys filename, ufilename, desc (description), size (uncompressed file
size), length (compressed length) for FileAttachment annot types, else None.
get_file()
Returns attached file content.
Return type bytes
Returns the content of the attached file.
update_file(buffer=None, filename=None, ufilename=None, desc=None)
Updates the content of an attached file. All arguments are optional. No arguments lead to a no-op.
Parameters
• buffer (bytes|bytearray|BytesIO) – the new file content. Omit to only change
meta-information.
(Changed in version 1.14.13) io.BytesIO is now also supported.
• filename (str) – new filename to associate with the file.
1 Rotating an annotation generally also changes its rectangle. Depending on how the annotation was defined, the original rectangle in general is
not reconstructible by setting the rotation value to zero. This information may be lost.
Key Description
rate (float, requ.) samples per second
channels (int, opt.) number of sound channels
bps (int, opt.) bits per sample value per channel
encoding (str, opt.) encoding format: Raw, Signed, muLaw, ALaw
compression (str, opt.) name of compression filter
stream (bytes, requ.) the sound file content
opacity
The annotation’s transparency. If set, it is a value in range [0, 1]. The PDF default is 1. However, in an
effort to tell the difference, we return -1.0 if not set.
Return type float
parent
The owning page object of the annotation.
Return type Page
rotation
The annot rotation.
Return type int
Returns a value [-1, 359]. If rotation is not at all, -1 is returned (and implies a rotation angle of
0). Other possible values are normalized to some value value 0 <= angle < 360.
rect
The rectangle containing the annotation.
Return type Rect
next
The next annotation on this page or None.
Return type Annot
type
A number and one or two strings describing the annotation type, like [2, ‘FreeText’, ‘FreeTextCallout’].
The second string entry is optional and may be empty. See the appendix Annotation Types for a list of
possible values and their meanings.
Return type list
info
A dictionary containing various information. All fields are optional strings. If an information is not
provided, an empty string is returned.
• name – e.g. for ‘Stamp’ annotations it will contain the stamp text like “Sold” or “Experimental”, for
other annot types you will see the name of the annot’s icon here (“PushPin” for FileAttachment).
• content – a string containing the text for type Text and FreeText annotations. Commonly used for
filling the text field of annotation pop-up windows.
• title – a string containing the title of the annotation pop-up window. By convention, this is used for
the annotation author.
• creationDate – creation timestamp.
• modDate – last modified timestamp.
• subject – subject.
• id – (new in version 1.16.10) a unique identification of the annotation. This is taken from PDF key
/NM. Annotations added by PyMuPDF will have a unique name, which appears here.
flags
An integer whose low order bits contain flags for how the annotation should be presented.
Return type int
line_ends
A pair of integers specifying start and end symbol of annotations types ‘FreeText’, ‘Line’, ‘PolyLine’, and
‘Polygon’. None if not applicable. For possible values and descriptions in this list, see the Adobe PDF
References, table 1.76 on page 400.
Return type tuple
vertices
A list containing a variable number of point (“vertices”) coordinates (each given by a pair of floats) for
various types of annotations:
• ‘Line’ – the starting and ending coordinates (2 float pairs).
• ‘FreeText’ – 2 or 3 float pairs designating the starting, the (optional) knee point, and the ending
coordinates.
• ‘PolyLine’ / ‘Polygon’ – the coordinates of the edges connected by line pieces (n float pairs for n
points).
• text markup annotations – 4 float pairs specifying the QuadPoints of the marked text span (see Adobe
PDF References, page 403).
• ‘Ink’ – list of one to many sublists of vertex coordinates. Each such sublist represents a separate line
in the drawing.
colors
dictionary of two lists of floats in range 0 <= float <= 1 specifying the “stroke” and the interior (“fill”)
colors. The stroke color is used for borders and everything that is actively painted or written (“stroked”).
The fill color is used for the interior of objects like line ends, circles and squares. The lengths of these lists
implicitely determine the colorspaces used: 1 = GRAY, 3 = RGB, 4 = CMYK. So “[1.0, 0.0, 0.0]” stands
for RGB color red. Both lists can be empty if no color is specified.
Return type dict
xref
The PDF xref.
Return type int
irt_xref
The PDF xref of an annotation to which this one responds. Return zero if this is no response annotation.
Return type int
popup_xref
The PDF xref of the associated Popup annotation. Zero if non-existent.
Return type int
has_popup
Whether the annotation has a Popup annotation.
Return type bool
is_open
Whether the annotation’s Popup is open – or the annotation itself (‘Text’ annotations only).
Return type bool
popup_rect
The rectangle of the associated Popup annotation. Infinite rectangle if non-existent.
Return type Rect
border
A dictionary containing border characteristics. Empty if no border information exists. The following keys
may be present:
• width – a float indicating the border thickness in points. The value is -1.0 if no width is specified.
• dashes – a sequence of integers specifying a line dash pattern. [] means no dashes, [n] means equal
on-off lengths of n points, longer lists will be interpreted as specifying alternating on-off length values.
See the Adobe PDF References page 126 for more details.
• style – 1-byte border style: “S” (Solid) = solid rectangle surrounding the annotation, “D” (Dashed)
= dashed rectangle surrounding the annotation, the dash pattern is specified by the dashes entry, “B”
(Beveled) = a simulated embossed rectangle that appears to be raised above the surface of the page,
“I” (Inset) = a simulated engraved rectangle that appears to be recessed below the surface of the page,
“U” (Underline) = a single line along the bottom of the annotation rectangle.
This is a list of icons referencable by name for annotation types ‘Text’ and ‘FileAttachment’. You can use them
via the icon parameter when adding an annotation, or use the as argument in Annot.set_name(). It is left to
your discretion which item to choose when – no mechanism will keep you from using e.g. the “Speaker” icon for a
‘FileAttachment’.
6.1.2 Example
Change the graphical image of an annotation. Also update the “author” and the text to be shown in the popup window:
doc = fitz.open("circle-in.pdf")
page = doc[0] # page 0
annot = page.first_annot # get the annotation
annot.set_border(dashes=[3]) # set dashes to "3 on, 3 off ..."
This is how the circle annotation looks like before and after the change (pop-up windows displayed using Nitro PDF
viewer):
6.2 Colorspace
__init__(self, n)
Constructor
Parameters n (int) – A number identifying the colorspace. Possible values are CS_RGB,
CS_GRAY and CS_CMYK.
name
The name identifying the colorspace. Example: fitz.csCMYK.name = ‘DeviceCMYK’.
Type str
n
The number of bytes required to define the color of one pixel. Example: fitz.csCMYK.n == 4.
type int
Predefined Colorspaces
For saving some typing effort, there exist predefined colorspace objects for the three available cases.
• csRGB = fitz.Colorspace(fitz.CS_RGB)
• csGRAY = fitz.Colorspace(fitz.CS_GRAY)
• csCMYK = fitz.Colorspace(fitz.CS_CMYK)
6.3 DisplayList
DisplayList is a list containing drawing commands (text, images, etc.). The intent is two-fold:
1. as a caching-mechanism to reduce parsing of a page
2. as a data structure in multi-threading setups, where one thread parses the page and another one renders pages.
This aspect is currently not supported by PyMuPDF.
A display list is populated with objects from a page, usually by executing Page.get_displaylist(). There
also exists an independent constructor.
“Replay” the list (once or many times) by invoking one of its methods run(), get_pixmap() or
get_textpage().
Class API
class DisplayList
__init__(self, mediabox)
Create a new display list.
Parameters mediabox (Rect) – The page’s rectangle.
Return type DisplayList
run(device, matrix, area)
Run the display list through a device. The device will populate the display list with its “commands” (i.e.
text extraction or image creation). The display list can later be used to “read” a page many times without
having to re-interpret it from the document file.
You will most probably instead use one of the specialized run methods below – get_pixmap() or
get_textpage().
Parameters
• device (Device) – Device
• matrix (Matrix) – Transformation matrix to apply to the display list contents.
• area (Rect) – Only the part visible within this area will be considered when the list is run
through the device.
get_pixmap(matrix=fitz.Identity, colorspace=fitz.csRGB, alpha=0, clip=None)
Run the display list through a draw device and return a pixmap.
Parameters
• matrix (Matrix) – matrix to use. Default is the identity matrix.
• colorspace (Colorspace) – the desired colorspace. Default is RGB.
• alpha (int) – determine whether or not (0, default) to include a transparency channel.
• clip (irect_like) – restrict rendering to the intersection of this area with
DisplayList.rect.
6.4 Document
This class represents a document. It can be constructed from a file or from memory.
There exists the alias open for this class, i.e. fitz.Document(...) and fitz.open(...) do exactly the same
thing.
For details on embedded files refer to Appendix 3.
Note: Starting with v1.17.0, a new page addressing mechanism for EPUB files only is supported. This document
type is internally organized in chapters such that pages can most efficiently be found by their so-called “location”.
The location is a tuple (chapter, pno) consisting of the chapter number and the page number in that chapter. Both
numbers are zero-based.
While it is still possible to locate a page via its (absoute) number, doing so may mean that the complete EPUB
document must be layouted before the page can be addressed. This may have a significant performance impact if the
document is very large. Using the page’s (chapter, pno) prevents this from happening.
To maintain a consistent API, PyMuPDF supports the page location syntax for all file types – documents without this
feature simply have just one chapter. Document.load_page() and the equivalent index access now also support
a location argument.
There are a number of methods for converting between page numbers and locations, for determining the chapter count,
the page count per chapter, for computing the next and the previous locations, and the last page location of a document.
Class API
class Document
__init__(self, filename=None, stream=None, filetype=None, rect=None, width=0,
height=0, fontsize=11)
Creates a Document object.
• With default parameters, a new empty PDF document will be created.
• If stream is given, then the document is created from memory and either filename or filetype
must indicate its type.
• If stream is None, then a document is created from the file given by filename. Its type is
inferred from the extension, which can be overruled by specifying filetype.
Parameters
• filename (str,pathlib) – A UTF-8 string or pathlib object containing a file
path (or a file type, see below).
• stream (bytes,bytearray,BytesIO) – A memory area containing a sup-
ported document. Its type must be specified by either filename or filetype.
(Changed in version 1.14.13) io.BytesIO is now also supported.
• filetype (str) – A string specifying the type of document. This may be some-
thing looking like a filename (e.g. “x.pdf”), in which case MuPDF uses the exten-
sion to determine the type, or a mime type like application/pdf. Just using strings
like “pdf” will also work.
• rect (rect_like) – a rectangle specifying the desired page size. This param-
eter is only meaningful for documents with a variable page layout (“reflowable”
documents), like e-books or HTML, and ignored otherwise. If specified, it must
be a non-empty, finite rectangle with top-left coordinates (0, 0). Together with pa-
rameter fontsize, each page will be accordingly laid out and hence also determine
the number of pages.
• width (float) – may used together with height as an alternative to rect to spec-
ify layout information.
• height (float) – may used together with width as an alternative to rect to
specify layout information.
• fontsize (float) – the default fontsize for reflowable document types. This
parameter is ignored if none of the parameters rect or width and height are speci-
fied. Will be used to calculate the page layout.
Overview of possible forms (open is a synonym of Document):
>>> # from a file
>>> doc = fitz.open("some.pdf")
>>> doc = fitz.open("some.file", None, "pdf") # copes with wrong
˓→extension
>>>
>>> # from memory
>>> doc = fitz.open("pdf", mem_area)
>>> doc = fitz.open(None, mem_area, "pdf")
>>> doc = fitz.open(stream=mem_area, filetype="pdf")
(continues on next page)
The Document class can be also be used as a context manager. On exit, the document will
automatically be closed.
>>> import fitz
>>> with fitz.open(...) as doc:
for page in doc: print("page %i" % page.number)
page 0
page 1
page 2
page 3
>>> doc.is_closed
True
>>>
get_oc(xref )
(New in v1.18.4)
Return the cross reference number of an OCG or OCMD attached to an image or form xobject.
Parameters xref (int) – the xref of an image or form xobject. Valid such
cross reference numbers are returned by Document.get_page_images(),
resp. Document.get_page_xobjects(). For invalid numbers, an exception
is raised.
Return type int
Returns the cross reference number of an optional contents object or zero if there is
none.
set_oc(xref, ocxref )
(New in v1.18.4)
If xref represents an image or form xobject, set or remove the cross reference number ocxref of
an optional contents object.
Parameters
• xref (int) – the xref of an image or form xobject5 . Valid such cross
reference numbers are returned by Document.get_page_images(), resp.
Document.get_page_xobjects(). For invalid numbers, an exception is
raised.
• ocxref (int) – the xref number of an OCG / OCMD. If not zero, an invalid
reference raises an exception. If zero, any OC reference is removed.
get_layers()
(New in v1.18.3)
Show optional layer configurations. There always is a standard one, which is not included in the
response.
>>> for item in doc.get_layers(): print(item)
{'number': 0, 'name': 'my-config', 'creator': ''}
>>> # use 'number' as config identifyer in add_ocg
Add an optional content configuration. Layers serve as a collection of ON / OFF states for
optional content groups and allow fast visibility switches between different views on the same
document.
Parameters
• name (str) – arbitrary name.
• creator (str) – (optional) creating software.
• on (sequ) – a sequence of OCG xref numbers which should be set to ON when
this layer gets activated. All OCGs not listed here will be set to OFF.
switch_layer(number, as_default=False)
(New in v1.18.3)
Switch to a document view as defined by the optional layer’s configuration number. This is
temporary, except if established as default.
Parameters
• number (int) – config number as returned by Document.
layer_configs().
• as_default (bool) – make this the default configuration.
Activates the ON / OFF states of OCGs as defined in the identified layer. If as_default=True,
then additionally all layers, including the standard one, are merged and the result is written back
to the standard layer, and all optional layers are deleted.
add_ocg(name, config=-1, on=True, intent="View", usage="Artwork")
(New in v1.18.3)
Add an optional content group. An OCG is the most important unit of information to determine
object visibility. For a PDF, in order to be regarded as having optional content, at least one OCG
must exist.
Parameters
• name (str) – arbitrary name. Will show up in supporting PDF viewers.
• config (int) – layer configuration number. Default -1 is the standard configu-
ration.
• on (bool) – standard visibility status for objects pointing to this OCG.
• intent (str,list) – a string or list of strings declaring the visibility intents.
There are two PDF standard values to choose from: “View” and “Design”. Default
is “View”. Correct spelling is important.
• usage (str) – another influencer for OCG visibility. This will become part of
the OCG’s /Usage key. There are two PDF standard values to choose from: “Art-
work” and “Technical”. Default is “Artwork”. Please only change when required.
Returns xref of the created OCG. Use as entry for oc parameter in supporting ob-
jects.
Note: Multiple OCGs with identical parameters may be created. This will not cause problems.
Garbage option 3 of Document.save() will get rid of any duplicates.
Note: Like an OCG, an OCMD has a visibility state ON or OFF, and it can be used like an
OCG. In contrast to an OCG, the OCMD state is determined by evaluating the state of one or
more OCGs via special forms of boolean expressions. If the expression evaluates to true, the
OCMD state is ON and OFF for false.
There are two ways to formulate OCMD visibility:
1. Use the combination of ocgs and policy: The policy value is interpreted as follows:
• AnyOn – (default) true if at least one OCG is ON.
• AnyOff – true if at least one OCG is OFF.
• AllOn – true if all OCGs are ON.
• AllOff – true if all OCGs are OFF.
Suppose you want two PDF objects be displayed exactly one at a time (if one is ON,
then the other one must be OFF):
Solution: use an OCG for object 1 and an OCMD for object 2. Create the OCMD via
set_ocmd(ocgs=[xref], policy="AllOff"), with the xref of the OCG.
2. Use the visibility expression ve: This is a list of two or more items. The first item is a
logical keyword: one of the strings “and”, “or”, or “not”. The second and all subsequent
items must either be an integer or another list. An integer must be the xref number of an
OCG. A list must again have at least two items starting with one of the boolean keywords.
This syntax is a bit awkward, but quite powerful:
• Each list must start with a logical keyword.
• If the keyword is a “not”, then the list must have exactly two items. If it is “and” or
“or”, any number of other items may follow.
• Items following the logical keyword may be either integers or again a list. An integer
must be the xref of an OCG. A list must conform to the previous rules.
Examples:
• set_ocmd(ve=["or", 4, ["not", 5], ["and", 6, 7]]). This de-
livers ON if the following is true: “4 is ON, or 5 is OFF, or 6 and 7 are both ON”.
• set_ocmd(ve=["not", xref]). This has the same effect as the OCMD ex-
ample created under 1.
For more details and examples see page 224 of Adobe PDF References. Also do have a
look at example scripts here.
Visibility expressions, /VE, are part of PDF specification version 1.6. So not all PDF
viewers / readers may already support this feature and hence will react in some standard
way for those cases.
get_ocmd(xref )
(New in v1.18.4)
Retrieve the definition of an OCMD.
Parameters xref (int) – the xref of the OCMD.
Return type dict
Returns a dictionary with the keys xref, ocgs, policy and ve.
get_layer(config=-1)
(New in v1.18.3)
List of optional content groups by status in the specified configuration. This is a dictionary with
lists of cross reference numbers for OCGs that occur in the arrays /ON, /OFF or in some radio
button group (/RBGroups).
Parameters config (int) – the configuration layer (default is the standard config
layer).
>>> pprint(doc.get_layer())
{'off': [8, 9, 10], 'on': [5, 6, 7], 'rbgroups': [[7, 10]]}
>>>
get_ocgs()
(New in v1.18.3)
Details of all optional content groups. This is a dictionary of dictionaries like this (key is the
OCG’s xref):
>>> pprint(doc.get_ocgs())
{13: {'on': True,
'intent': ['View', 'Design'],
'name': 'Circle',
'usage': 'Artwork'},
14: {'on': True,
'intent': ['View', 'Design'],
'name': 'Square',
'usage': 'Artwork'},
15: {'on': False, 'intent': ['View'], 'name': 'Square', 'usage':
˓→'Artwork'}}
>>>
layer_ui_configs()
(New in v1.18.3)
Show the visibility status of optional content that is modifyable by the user interface of support-
ing PDF viewers. Example:
>>> pprint(doc.layer_ui_configs())
({'depth': 0,
'locked': False,
'number': 0,
'on': True,
'text': 'Circle',
'type': 'checkbox'},
{'depth': 0,
'locked': False,
'number': 1,
'on': False,
'text': 'Square',
'type': 'checkbox'})
>>> # refers to OCGs named "Circle" (ON), resp. "Square" (OFF)
Note:
• Only reports items contained in the currently selected layer configuration.
• The meaning of the dictionary keys is as follows:
– depth: item’s nesting level in the /Order array
– locked: whether changing the item’s state is prohibited
– number: running sequence number
– on: item state
– text: text string or name field of the originating OCG
– type: one of “label” (set by a text string), “checkbox” (set by a single OCG) or
“radiobox” (set by a set of connected OCGs)
set_layer_ui_config(number, action=0)
(New in v1.18.3)
Modify OC visibility status of content groups. This is analog to what supporting PDF viewers
would offer.
Note: Visibility is not a property stored with the OCG. It is not even an information necessarily
present in the PDF document at all. Instead, the current visibility is temporarily set using the
user interface of some supporting PDF consumer software. The same type of functionality is
offered by this method.
To make permanent changes, use Document.set_layer().
Parameters
• number (in) – number as returned by Document.
layer_ui_configs().
• action (int) – 0 = set on (default), 1 = toggle on/off, 2 = set off.
Example:
authenticate(password)
Decrypts the document with the string password. If successful, document data can be accessed.
For PDF documents, the “owner” and the “user” have different priviledges, and hence different
passwords may exist for these authorization levels. The method will automatically establish the
appropriate (owner or user) access rights for the provided password.
Parameters password (str) – owner or user password.
Return type int
Returns
a positive value if successful, zero otherwise (the string does not match either
password). If positive, the indicator Document.is_encrypted is set to
False. Positive return codes carry the following information detail:
• 1 => authenticated, but the PDF has neither owner nor user passwords.
• 2 => authenticated with the user password.
• 4 => authenticated with the owner password.
• 6 => authenticated and both passwords are equal – probably a rare situation.
Note: The document may be protected by an owner, but not by a user pass-
word. Detect this situation via doc.authenticate(“”) == 2. This allows open-
ing and reading the document without authentication, but, depending on the
Document.permissions value, other actions may be prohibited. PyMuPDF
(like MuPDF) in this case ignores those restrictions. So, – in contrast to any
PDF viewers – you can for example extract text and add or modify content, even
if the respective permission flags PDF_PERM_COPY, PDF_PERM_MODIFY,
PDF_PERM_ANNOTATE, etc. are set off! It is your responsibility building a
legally compliant application where applicable.
get_page_numbers(label, only_one=False)
(New in v 1.18.6)
PDF only: Return a list of page numbers that have the specified label – note that labels may
not be unique in a PDF. This implies a sequential search through all page numbers to compare
their labels.
Note: Implementation detail – pages are not loaded for this purpose.
Parameters
• label (str) – the label to look for, e.g. “vii” (Roman number 7).
• only_one (bool) – stop after first hit. Useful e.g. if labelling is known to
be unique, or there are many pages, etc. The default will check every page
number.
Return type list
Returns list of page numbers that have this label. Empty if none found, no labels
defined, etc.
get_page_labels()
(New in v1.18.7)
PDF only: Extract the list of page label definitions. Typically used for modifications before
feeding it into Document.set_page_labels().
Returns a list of dictionaries as defined in Document.set_page_labels().
set_page_labels(labels)
(New in v1.18.6)
PDF only: Add or update the page label definitions of the PDF.
Parameters labels (list) – a list of dictionaries. Each dictionary defines a
label building rule and a 0-based “start” page number. That start page is the
first for which the label definition is valid. Each dictionary has up to 4 items
and looks like {'startpage': int, 'prefix': str, 'style':
str, 'firstpagenum': int} and has the following items.
• startpage: (int) the first page number (0-based) to apply the label rule. This
key must be present. The rule is applied to all subsequent pages until either
end of document or superseded by the rule with the next larger page number.
• prefix: (str) an arbitrary string to start the label with, e.g. “A-“. Default is
“”.
• style: (str) the numbering style. Available are “D” (decimal), “r”/”R” (Ro-
man numbers, lower / upper case), and “a”/”A” (lower / upper case alphabetical
numbering: “a” through “z”, then “aa” through “az”, etc.). Default is “”. If “”,
no numbering will take place and the pages in that range will receive the same
label consisting of the prefix value. If prefix is also omitted, then the label
will be “”.
• firstpagenum: (int) start numbering with this value. Default is 1, smaller
values are ignored.
For example:
will generate the labels “A-10”, “A-11”, “A-12”, “A-13”, “1”, “2”, “3”, . . . for pages 6, 7 and
so on until end of document. Pages 0 through 5 will have the label “”.
make_bookmark(loc)
(New in v.1.17.3) Return a page pointer in a reflowable document. After re-layouting the docu-
ment, the result of this method can be used to find the new location of the page.
find_bookmark(bookmark)
(New in v.1.17.3) Return the new page location after re-layouting the document.
Parameters bookmark (pointer) – created by Document.
make_bookmark().
Return type tuple
Returns the new (chapter, pno) of the page.
chapter_page_count(chapter)
(New in v.1.17.0) Return the number of pages of a chapter.
Parameters chapter (int) – the 0-based chapter number.
Return type int
Returns number of pages in chapter. Relevant only for document types whith chapter
support (EPUB currently).
next_location(page_id)
(New in v.1.17.0) Return the location of the following page.
Parameters page_id (tuple) – the current page id. This must be a tuple (chapter,
pno) identifying an existing page.
Returns The tuple of the following page, i.e. either (chapter, pno + 1) or (chapter +
1, 0), or the empty tuple () if the argument was the last page. Relevant only for
document types whith chapter support (EPUB currently).
prev_location(page_id)
(New in v.1.17.0) Return the locator of the preceeding page.
Parameters page_id (tuple) – the current page id. This must be a tuple (chapter,
pno) identifying an existing page.
Returns The tuple of the preceeding page, i.e. either (chapter, pno - 1) or the last
page of the receeding chapter, or the empty tuple () if the argument was the first
page. Relevant only for document types whith chapter support (EPUB currently).
load_page(page_id=0)
Create a Page object for further processing (like rendering, text searching, etc.).
(Changed in v1.17.0) For document types supporting a so-called “chapter structure” (like
EPUB), pages can also be loaded via the combination of chapter number and relative page
number, instead of the absolute page number. This should significantly speed up access for
large documents.
Parameters page_id (int,tuple) – (Changed in v1.17.0)
Either a 0-based page number, or a tuple (chapter, pno). For an integer, any
-∞ < page_id < page_count is acceptable. While page_id is negative,
page_count will be added to it. For example: to load the last page, you can
use doc.load_page(-1). After this you have page.number = doc.page_count - 1.
For a tuple, chapter must be in range Document.chapter_count, and pno
must be in range Document.chapter_page_count() of that chapter. Both
values are 0-based. Using this notation, Page.number will equal the given
tuple. Relevant only for document types whith chapter support (EPUB currently).
Return type Page
Note: Documents also follow the Python sequence protocol with page numbers as indices:
doc.load_page(n) == doc[n].
For absolute page numbers only, expressions like “for page in doc: . . . ” and “for page in re-
versed(doc): . . . ” will successively yield the document’s pages. Refer to Document.pages()
which allows processing pages as with slicing.
You can also use index notation with the new chapter-based page identification: use page = doc[(5,
2)] to load the third page of the sixth chapter.
To maintain a consistent API, for document types not supporting a chapter structure (like PDFs),
Document.chapter_count is 1, and pages can also be loaded via tuples (0, pno). See this3
footnote for comments on performance improvements.
reload_page(page)
(New in version 1.16.10)
PDF only: Provide a new copy of a page after finishing and updating all pending changes.
Parameters page (Page) – page object.
Return type Page
Returns
a new copy of the same page. All pending updates (e.g. to annotations or widgets)
will be finalized and a fresh copy of the page will be loaded.
Note: In a typical use case, a page Pixmap should be taken after annotations /
widgets have been added or changed. To force all those changes being reflected in
the page structure, this method re-instates a fresh copy while keeping the object
hierarchy “document -> page -> annotations/widgets” intact.
page_cropbox(pno)
(New in version 1.17.7)
PDF only: Return the unrotated page rectangle – without loading the page (via Document.
load_page()). This is meant for internal purpose requiring best possible performance.
Parameters pno (int) – 0-based page number.
Returns Rect of the page like Page.rect(), but ignoring any rotation.
page_xref(pno)
(New in version 1.17.7)
3 For applicable (EPUB) document types, loading a page via its absolute number may result in layouting a large part of the document, before
the page can be accessed. To avoid this performance impact, prefer chapter-based access. Use convenience methods and attributes Document.
next_location(), Document.prev_location() and Document.last_location for maintaining a high level of coding efficiency.
PDF only: Return the xref of the page – without loading the page (via Document.
load_page()). This is meant for internal purpose requiring best possible performance.
Parameters pno (int) – 0-based page number.
Returns xref of the page like Page.xref.
pages(start=None[, stop=None[, step=None ]])
(New in version 1.16.4)
A generator for a range of pages. Parameters have the same meaning as in the built-in function
range(). Intended for expressions of the form “for page in doc.pages(start, stop, step): . . . ”.
Parameters
• start (int) – start iteration with this page number. Default is zero, al-
lowed values are -∞ < start < page_count. While this is negative,
page_count is added before starting the iteration.
• stop (int) – stop iteration at this page number. Default is page_count,
possible are -∞ < stop <= page_count. Larger values are silently re-
placed by the default. Negative values will cyclically emit the pages in reversed
order. As with the built-in range(), this is the first page not returned.
• step (int) – stepping value. Defaults are 1 if start < stop and -1 if start >
stop. Zero is not allowed.
Returns
a generator iterator over the document’s pages. Some examples:
• ”doc.pages()” emits all pages.
• ”doc.pages(4, 9, 2)” emits pages 4, 6, 8.
• ”doc.pages(0, None, 2)” emits all pages with even numbers.
• ”doc.pages(-2)” emits the last two pages.
• ”doc.pages(-1, -1)” emits all pages in reversed order.
• ”doc.pages(-1, -10)” always emits 10 pages in reversed order, starting with the
last page – repeatedly if the document has less than 10 pages. So for a 4-page
document the following page numbers are emitted: 3, 2, 1, 0, 3, 2, 1, 0, 3, 2, 1,
0, 3.
convert_to_pdf(from_page=-1, to_page=-1, rotate=0)
Create a PDF version of the current document and write it to memory. All document types are
supported. The parameters have the same meaning as in insert_pdf(). In essence, you can
restrict the conversion to a page subset, specify page rotation, and revert page sequence.
Parameters
• from_page (int) – first page to copy (0-based). Default is first page.
• to_page (int) – last page to copy (0-based). Default is last page.
• rotate (int) – rotation angle. Default is 0 (no rotation). Should be n * 90
with an integer n (not checked).
Return type bytes
Returns
imgpdf=fitz.open("pdf", pdfbytes)
doc.insert_pdf(imgpdf) # insert the
˓→image PDF
>>> doc.save("allmyimages.pdf")
Note: The method uses the same logic as the mutool convert CLI. This works very well in
most cases – however, beware of the following limitations.
• Image files: perfect, no issues detected. Apparently however, image transparency is ig-
nored. If you need that (like for a watermark), use Page.insert_image() instead.
Otherwise, this method is recommended for its much better prformance.
• XPS: appearance very good. Links work fine, outlines (bookmarks) are lost, but can easily
be recovered2 .
• EPUB, CBZ, FB2: similar to XPS.
• SVG: medium. Roughly comparable to svglib.
get_toc(simple=True)
Creates a table of contents (TOC) out of the document’s outline chain.
Parameters simple (bool) – Indicates whether a simple or a detailed TOC is re-
quired. If False, each item of the list also contains a dictionary with linkDest
details for each outline entry.
Return type list
Returns
2 However, you can use Document.get_toc() and Page.get_links() (which are available for all document types) and copy this
a list of lists. Each entry has the form [lvl, title, page, dest]. Its entries have the
following meanings:
• lvl – hierarchy level (positive int). The first entry is always 1. Entries in a row
are either equal, increase by 1, or decrease by any number.
• title – title (str)
• page – 1-based page number (int). If -1 either no destination or outside docu-
ment.
• dest – (dict) included only if simple=False. Contains details of the TOC item
as follows:
– kind: destination kind, see Link Destination Kinds.
– file: filename if kind is LINK_GOTOR or LINK_LAUNCH.
– page: target page, 0-based, LINK_GOTOR or LINK_GOTO only.
– to: position on target page (Point).
– zoom: (float) zoom factor on target page.
– xref: xref of the item (0 if no PDF).
– color: item color in PDF RGB format (red, green, blue), or omitted
(always omitted if no PDF).
– bold: true if bold item text or omitted. PDF only.
– italic: true if italic item text, or omitted. PDF only.
– collapse: true if sub-items are folded, or omitted. PDF only.
xref_get_keys(xref )
(New in v1.18.7)
PDF only: Return the PDF dictionary keys of the object provided by its xref number.
Parameters xref (int) – the xref. (Changed in v1.18.10) Use -1 to access the
special dictionary “PDF trailer”.
Returns
a tuple of dictionary keys present in object xref. Examples:
>>>
xref_get_key(xref, key)
(New in v1.18.7)
PDF only: Return type and value of a PDF dictionary key of an xref.
Parameters
• xref (int) – the xref. Changed in v1.18.10: Use -1 to access the special
dictionary “PDF trailer”.
• key (str) – the desired PDF key. Must exactly match (case-sensitive) one of
the keys contained in Document.xref_get_keys().
Returns
a tuple (type, value) of strings, where type is one of “xref”, “array”, “dict”, “int”,
“float”, “null”, “bool”, “name”, “string” or “unknown” (should not occur). Inde-
pendent of “type”, the value of the key is always formatted as a string – see the
following example – and (almost always) a faithful reflection of what is stored in
the PDF. In most cases, the format of the value string also gives a clue about the
key type:
• A “name” always starts with a “/” slash.
• An “xref” always ends with ” 0 R”.
• An “array” is always enclosed in “[. . . ]” brackets.
• A “dict” is always enclosed in “<<. . . >>” brackets.
• A “bool”, resp. “null” always equal either “true”, “false”, resp. “null”.
• ”float” and “int” are represented by their string format – and are thus not always
distinguishable.
• A “string” is converted to UTF-8 and may therefore deviate from what is
stored in the PDF. For example, the PDF key “Author” may have a value of
“<FEFF004A006F0072006A00200058002E0020004D0063004B00690065>”
in the file, but the method will return ('string', 'Jorj X. McKie').
Caution: This is an expert function: if you do not know what you are doing, there is a high
risk to render (parts of) the PDF unusable. Please do consult Adobe PDF References about
object specification formats (page 18) and the structure of special dictionary types like page
objects.
Parameters
• xref (int) – the xref. Changed in v1.18.13: To update the PDF trailer,
specify -1.
• key (str) – the desired PDF key (without leading “/”). Must not be empty.
Any valid PDF key – whether already present in the object (which will be over-
written) – or new. It is possible to use PDF path notation like "Resources/
ExtGState" – which sets the value for key "/ExtGState" as a sub-object
of "/Resources".
• value (str) – the value for the key. It must be a non-empty string and, de-
pending on the desired PDF object type, the following rules must be observed.
There is some syntax checking, but no type checking and no checking if it
makes sense PDF-wise, i.e. no semantics checking. Upper or lower case are
important!
– xref – must be provided as "nnn 0 R" with a valid xref number nnn of
the PDF. The suffix “0 R” is required to be recognizable as an xref by PDF
applications.
– array – a string like "[a b c d e f]". The brackets are required.
Array items must be separated by at least one space (not commas like in
Python). An empty array "[]" is possible and equivalent to removing the
key. Array items may be any PDF objects, like dictionaries, xrefs, other
arrays, etc. Like in Python, array items may be of different types.
– dict – a string like "<< ... >>". The brackets are required and must
enclose a valid PDF dictionary definition. The empty dictionary "<<>>" is
possible and equivalent to removing the key.
– int – an integer formatted as a string.
– float – a float formatted as a string. Scientific notation (with exponents) is
not allowed by PDF.
– null – the string "null". This is the PDF equivalent to Python’s None
and causes the key to be ignored – however not necessarily removed, resp.
removed on saves with garbage collection.
– bool – one of the strings "true" or "false".
– name – a valid PDF name with a leading slash: "/PageLayout". See
page 16 of the Adobe PDF References.
– string – a valid PDF string. All PDF strings must be enclosed by brackets.
Denote the empty string as "()". Depending on its content, the possible
brackets are
* ”<. . . >” for hex-encoded text. Every character must be represented by two
hex-digits (lower or upper case).
– If in doubt, we strongly recommend to use get_pdf_str()! This func-
tion automatically generates the right brackets, escapes, and overall format.
E.g. it will do conversions like these:
˓→'
Note: In general, this is not the list of images that are actually displayed. This method
only parses several PDF objects to collect references to embedded images. It does not analyse
the page’s contents, where all the actual image display commands are defined. To get this
information, please use Page.get_image_info(). Also have a look at the discussion in
section Structure of Dictionary Outputs.
get_page_fonts(pno, full=False)
PDF only: Return a list of all fonts (directly or indirectly) referenced by the page.
Parameters
• pno (int) – page number, 0-based, -∞ < pno < page_count.
• full (bool) – whether to also include the referencer’s xref. If True,
the returned items are one entry longer. Use this option if you need to
know, whether the page directly references the font. In this case the last
entry is 0. If the font is referenced by an /XObject of the page, you will
find its xref here.
Return type list
Returns a list of fonts referenced by this page. Each entry looks like
(xref, ext, type, basefont, name, encoding, referencer),
where
• xref (int) is the font object number (may be zero if the PDF uses one of the builtin fonts
directly)
• ext (str) font file extension (e.g. “ttf”, see Font File Extensions)
• type (str) is the font type (like “Type1” or “TrueType” etc.)
• basefont (str) is the base font name,
• name (str) is the symbolic name, by which the font is referenced
• encoding (str) the font’s character encoding if different from its built-in encoding (Adobe
PDF References, p. 254):
• referencer (int optional) the xref of the referencer. Zero if directly referenced by the
page, otherwise the xref of an XObject. Only present if full=True.
Example:
Note:
• This list has no duplicate entries: the combination of xref, name and referencer is
unique.
• In general, this is a superset of the fonts actually in use by this page. The PDF creator
may e.g. have specified some global list, of which each page only makes partial use.
select(s)
PDF only: Keeps only those pages of the document whose numbers occur in the list. Empty
sequences or elements outside range(doc.page_count) will cause a ValueError. For
more details see remarks at the bottom or this chapter.
Parameters s (sequence) – The sequence (see Using Python Sequences as Ar-
guments in PyMuPDF) of page numbers (zero-based) to be included. Pages
not in the sequence will be deleted (from memory) and become unavailable
until the document is reopened. Page numbers can occur multiple times
and in any order: the resulting document will reflect the sequence exactly as
specified.
Note:
• Page numbers in the sequence need not be unique nor be in any particular order. This
makes the method a versatile utility to e.g. select only the even or the odd pages or
meeting some other criteria and so forth.
• On a technical level, the method will always create a new pagetree.
• When dealing with only a few pages, methods copy_page(), move_page(),
delete_page() are easier to use. In fact, they are also much faster – by at least
one order of magnitude when the document has many pages.
set_metadata(m)
PDF only: Sets or updates the metadata of the document as specified in m, a Python dictionary.
Parameters m (dict) – A dictionary with the same keys as metadata (see below).
All keys are optional. A PDF’s format and encryption method cannot be set or
changed and will be ignored. If any value should not contain data, do not spec-
ify its key or set the value to None. If you use {} all metadata information will
be cleared to the string “none”. If you want to selectively change only some
values, modify a copy of doc.metadata and use it as the argument. Arbitrary
unicode values are possible if specified as UTF-8-encoded.
(Changed in v1.18.4) Empty values or “none” are no longer written, but completely omitted.
get_xml_metadata()
PDF only: Get the document XML metadata.
Return type str
Returns XML metadata of the document. Empty string if not present or not a PDF.
set_xml_metadata(xml)
PDF only: Sets or updates XML metadata of the document.
Parameters xml (str) – the new XML metadata. Should be XML syntax, how-
ever no checking is done by this method and any string is accepted.
set_toc(toc, collapse=1)
PDF only: Replaces the complete current outline tree (table of contents) with the one pro-
vided as the argument. After successful execution, the new outline tree can be accessed as
usual via Document.get_toc() or via Document.outline. Like with other output-
oriented methods, changes become permanent only via save() (incremental save supported).
Internally, this method consists of the following two steps. For a demonstration see example
below.
• Step 1 deletes all existing bookmarks.
Parameters
• toc (sequence) – A list / tuple with all bookmark entries that should
form the new table of contents. Output variants of get_toc() are ac-
ceptable. To completely remove the table of contents specify an empty
sequence or None. Each item must be a list with the following format.
– [lvl, title, page [, dest]] where
* lvl is the hierarchy level (int > 0) of the item, which must be 1 for the
first item and at most 1 larger than the previous one.
outline_xref(idx)
(New in v1.17.7)
PDF only: Return the xref of the outline item. This is mainly used for internal purposes.
arg int idx: index of the item in list Document.get_toc().
Returns xref.
del_toc_item(idx)
• New in v1.17.7
• Changed in v1.18.14: no longer remove the item’s text, but show it grayed-out.
PDF only: Remove this TOC item. This is a high-speed method, which disables the respective
item, but leaves the overall TOC struture intact. Physically, the item still exists in the TOC
tree, but is shown grayed-out and will no longer point to any destination.
This also implies that you can reassign the item to a new destination using Document.
set_toc_item(), when required.
Parameters idx (int) – the index of the item in list Document.get_toc().
set_toc_item(idx, dest_dict=None, kind=None, pno=None, uri=None, title=None,
to=None, filename=None, zoom=0)
• New in v1.17.7
• Changed in v1.18.6
PDF only: Changes the TOC item identified by its index. Change the item title, destination,
appearance (color, bold, italic) or collapsing sub-items – or to remove the item altogether.
Use this method if you need specific changes for selected entries only and want to avoid replac-
ing the complete TOC. This is beneficial especially when dealing with large table of contents.
Parameters
• idx (int) – the index of the entry in the list created by Document.
get_toc().
• dest_dict (dict) – the new destination. A dictionary like the last
entry of an item in doc.get_toc(False). Using this as a template is
recommended. When given, all other parameters are ignored – except
title.
• kind (int) – the link kind, see Link Destination Kinds. If LINK_NONE,
then all remaining parameter will be ignored, and the TOC item will be
removed – same as Document.del_toc_item(). If None, then only
the title is modified and the remaining parameters are ignored. All other
values will lead to making a new destination dictionary using the subse-
quent arguments.
• pno (int) – the 1-based page number, i.e. a value 1 <= pno <=
doc.page_count. Required for LINK_GOTO.
• uri (str) – the URL text. Required for LINK_URI.
• title (str) – the desired new title. None if no change.
• to (point_like) – (optional) points to a coordinate on the arget page.
Relevant for LINK_GOTO. If omitted, a point near the page’s top is cho-
sen.
• filename (str) – required for LINK_GOTOR and LINK_LAUNCH.
• zoom (float) – use this zoom factor when showing the target page.
Example use: Change the TOC of the SWIG manual to achieve this:
Collapse everything below top level and show the chapter on Python support in red, bold and
italic:
In the previous example, we have changed only 42 of the 1240 TOC items of the file.
can_save_incrementally()
(New in version 1.16.0)
Check whether the document can be saved incrementally. Use it to choose the right option
without encountering exceptions.
scrub(attached_files=True, clean_pages=True, embedded_files=True, hidden_text=True,
javascript=True, metadata=True, redactions=True, redact_images=0, re-
move_links=True, reset_fields=True, reset_responses=True, thumbnails=True,
xml_metadata=True)
PDF only: (New in v1.16.14) Remove potentially sensitive data from the PDF. This function is
inspired by the similar “Sanitize” function in Adobe Acrobat products. The process is config-
urable by a number of options, which are all True by default.
Parameters
• attached_files (bool) – Search for ‘FileAttachment’ annotations
and remove the file content.
• clean_pages (bool) – Remove any comments from page painting
sources. If this option is set to False, then this is also done for hidden_text
and redactions.
• embedded_files (bool) – Remove embedded files.
• hidden_text (bool) – Remove OCRed text and invisible text7 .
• javascript (bool) – Remove JavaScript sources.
• metadata (bool) – Remove PDF standard metadata.
• redactions (bool) – Apply redaction annotations.
• redact_images (int) – how to handle images if applying redactions.
One of 0 (ignore), 1 (blank out overlaps) or 2 (remove).
• remove_links (bool) – Remove all links.
• reset_fields (bool) – Reset all form fields to their defaults.
• reset_responses (bool) – Remove all responses from all annota-
tions.
• thumbnails (bool) – Remove thumbnail images from pages.
• xml_metadata (bool) – Remove XML metadata.
save(outfile, garbage=0, clean=False, deflate=False, deflate_images=False, de-
flate_fonts=False, incremental=False, ascii=False, expand=0, linear=False,
pretty=False, no_new_id=False, encryption=PDF_ENCRYPT_NONE, permissions=-
1, owner_pw=None, user_pw=None)
• Changed in v1.18.7
• Changed in v1.19.0
PDF only: Saves the document in its current state.
Parameters
7 This only works under certain conditions. For example, if there is normal text covered by some image on top of it, then this is undetectable
and the respective text is not removed. Similar is true for white text on white background, and so on.
PostScript to do this (pp. 643 in Adobe PDF References), which gets interpreted when a page is loaded.
4 These parameters cause separate handling of stream categories: use it together with expand to restrict decompression to streams other than
images / fontfiles.
Note: The method does not check, whether a file of that name already exists, will hence not
ask for confirmation, and overwrite the file. It is your responsibility as a programmer to handle
this.
ez_save(*args, **kwargs)
(New in v1.18.11)
PDF only: The same as Document.save() but with the changed defaults deflate=True,
garbage=3.
saveIncr()
PDF only: saves the document incrementally. This is a convenience abbreviation for
doc.save(doc.name, incremental=True, encryption=PDF_ENCRYPT_KEEP).
tobytes(garbage=0, clean=False, deflate=False, deflate_images=False, de-
flate_fonts=False, ascii=False, expand=0, linear=False, pretty=False,
no_new_id=False, encryption=PDF_ENCRYPT_NONE, permissions=-1,
owner_pw=None, user_pw=None)
• Changed in v1.18.7
• Changed in v1.19.0
PDF only: Writes the current content of the document to a bytes object instead of to a file.
Obviously, you should be wary about memory requirements. The meanings of the parameters
exactly equal those in save(). Chapter Collection of Recipes contains an example for using
this method as a pre-processor to pdfrw.
(Changed in version 1.16.0) for extended encryption support.
Return type bytes
Returns a bytes object containing the complete document.
search_page_for(pno, text, quads=False)
Search for “text” on page number “pno”. Works exactly like the corresponding Page.
search_for(). Any integer -∞ < pno < page_count is acceptable.
insert_pdf(docsrc, from_page=-1, to_page=-1, start_at=-1, rotate=-1, links=True, an-
nots=True, show_progress=0, final=1)
• Changed in v1.19.3 - as a fix to issue #537, form fields are always excluded.
PDF only: Copy the page range [from_page, to_page] (including both) of PDF document
docsrc into the current one. Inserts will start with page number start_at. Value -1 indicates
default values. All pages thus copied will be rotated as specified. Links and annotations can
be excluded in the target, see below. All page numbers are 0-based.
Parameters
• docsrc (Document) – An opened PDF Document which must not be the
current document. However, it may refer to the same underlying file.
• from_page (int) – First page number in docsrc. Default is zero.
• to_page (int) – Last page number in docsrc to copy. Defaults to last
page.
• start_at (int) – First copied page, will become page number start_at
in the target. Default -1 appends the page range to the end. If zero, the
page range will be inserted before current first page.
• rotate (int) – All copied pages will be rotated by the provided value
(degrees, integer multiple of 90).
• links (bool) – Choose whether (internal and external) links should be
included in the copy. Default is True. Internal links to outside the copied
page range are always excluded.
• annots (bool) – (new in version 1.16.1) choose whether annotations
should be included in the copy. (Fixed in v1.19.3) Form fields can never
be copied.
• show_progress (int) – (new in version 1.17.7) specify an interval size
greater zero to see progress messages on sys.stdout. After each inter-
val, a message like Inserted 30 of 47 pages. will be printed.
• final (int) – (new in v1.18.0) controls whether the list of already copied
objects should be dropped after this method, default True. Set it to 0
except for the last one of multiple insertions from the same source PDF.
This saves target file size and speeds up execution considerably.
Note:
1. If from_page > to_page, pages will be copied in reverse order. If 0 <= from_page ==
to_page, then one page will be copied.
2. docsrc TOC entries will not be copied. It is easy however, to recover a table of contents for the
resulting document. Look at the examples below and at program PDFjoiner.py in the examples
directory: it can join PDF documents and at the same time piece together respective parts of
the tables of contents.
Parameters pno (int) – the page to be deleted. Negative number count back-
wards from the end of the document (like with indices). Default is the last
page.
delete_pages(*args, **kwds)
• Changed in v1.18.13: more flexibility specifying pages to delete.
• Changed in v1.18.14: support Python’s del statement.
PDF only: Delete multiple pages given as 0-based numbers.
Format 1: Use keywords. Represents the old format. A contiguous range of pages is removed.
It will also remove any links on remaining pages which point to a deleted one. This action
may have an extended response time for documents with many pages.
Following examples will all delete pages 500 through 519:
• doc.delete_pages(500, 519)
• doc.delete_pages(from_page=500, to_page=519)
• doc.delete_pages((500, 501, 502, ... , 519))
• doc.delete_pages(range(500, 520))
• del doc[500:520]
• del doc[(500, 501, 502, ... , 519)]
• del doc[range(500, 520)]
For the Adobe PDF References the above takes about 0.6 seconds, because the remaining 1290
pages must be cleaned from invalid links.
In general, the performance of this method is dependent on the number of remaining pages –
not on the number of deleted pages: in the above example, deleting all pages except those 20,
will need much less time.
copy_page(pno, to=-1)
PDF only: Copy a page reference within the document.
Parameters
• pno (int) – the page to be copied. Must be in range 0 <= pno < len(doc).
• to (int) – the page number in front of which to copy. The default inserts
after the last page.
Note: Only a new reference to the page object will be created – not a new page object, all
copied pages will have identical attribute values, including the Page.xref. This implies that
any changes to one of these copies will appear on all of them.
fullcopy_page(pno, to=-1)
(New in version 1.14.17)
PDF only: Make a full copy (duplicate) of a page.
Parameters
• pno (int) – the page to be duplicated. Must be in range 0 <= pno <
len(doc).
• to (int) – the page number in front of which to copy. The default inserts
after the last page.
Note:
• In contrast to copy_page(), this method creates a new page object (with a new xref),
which can be changed independently from the original.
• Any Popup and “IRT” (“in response to”) annotations are not copied to avoid potentially
incorrect situations.
move_page(pno, to=-1)
PDF only: Move (copy and then delete original) a page within the document.
Parameters
• pno (int) – the page to be moved. Must be in range 0 <= pno < len(doc).
• to (int) – the page number in front of which to insert the moved page.
The default moves after the last page.
need_appearances(value=None)
(New in v1.17.4)
PDF only: Get or set the /NeedAppearances property of Form PDFs. Quote: “(Optional) A
flag specifying whether to construct appearance streams and appearance dictionaries for all
widget annotations in the document . . . Default value: false.” This may help controlling the
behavior of some readers / viewers.
Parameters value (bool) – set the property to this value. If omitted or None,
inquire the current value.
Return type bool
Returns
• None: not a Form PDF, or property not defined.
• True / False: the value of the property (either just set or existing for in-
quiries). Has no effect if no Form PDF.
get_sigflags()
PDF only: Return whether the document contains signature fields. This is an optional PDF
property: if not present (return value -1), no conclusions can be drawn – the PDF creator may
just not have bothered to use it.
Return type int
Returns
• -1: not a Form PDF / no signature fields recorded / no SigFlags found.
• 1: at least one signature field exists.
• 3: contains signatures that may be invalidated if the file is saved (written)
in a way that alters its previous contents, as opposed to an incremental
update.
embfile_add(name, buffer, filename=None, ufilename=None, desc=None)
PDF only: Embed a new file. All string parameters except the name may be unicode (in
previous versions, only ASCII worked correctly). File contents will be compressed (where
beneficial).
Changed in version 1.14.16 The sequence of positional parameters “name” and “buffer” has
been changed to comply with the layout of other functions.
Parameters
• name (str) – entry identifier, must not already exist.
• buffer (bytes,bytearray,BytesIO) – file contents.
(Changed in version 1.14.13) io.BytesIO is now also supported.
• filename (str) – optional filename. Documentation only, will be set to
name if None.
embfile_count()
PDF only: Return the number of embedded files.
Changed in version 1.14.16 This is now a method. In previous versions, this was
a property.
embfile_get(item)
PDF only: Retrieve the content of embedded file by its entry number or name. If the document
is not a PDF, or entry cannot be found, an exception is raised.
Parameters item (int,str) – index or name of entry. An integer must be in
range(embfile_count()).
Return type bytes
embfile_del(item)
PDF only: Remove an entry from /EmbeddedFiles. As always, physical deletion of the embed-
ded file content (and file space regain) will occur only when the document is saved to a new
file with a suitable garbage option.
Changed in version 1.14.16 Items can now be deleted by index, too.
Warning: When specifying an entry name, this function will only delete the first item
with that name. Be aware that PDFs not created with PyMuPDF may contain duplicate
names. So you may want to take appropriate precautions.
embfile_info(item)
(Changed in v1.18.13)
PDF only: Retrieve information of an embedded file given by its number or by its name.
Parameters item (int/str) – index or name of entry. An integer must be in
range(embfile_count()).
Return type dict
Returns
a dictionary with the following keys:
• name – (str) name under which this entry is stored
• filename – (str) filename
• ufilename – (unicode) filename
• desc – (str) description
>>> d = doc.extract_image(1373)
>>> d
{'ext': 'png', 'smask': 2934, 'width': 5, 'height': 629, 'colorspace': 3,
˓→'xres': 96,
Note: There is a functional overlap with pix = fitz.Pixmap(doc, xref), followed by a pix.tobytes(). Main
differences are that extract_image, (1) does not always deliver PNG image formats, (2) is very much faster
with non-PNG images, (3) usually results in much less disk storage for extracted images, (4) returns None
in error cases (generates no exception). Look at the following example images within the same PDF.
• xref 1268 is a PNG – Comparable execution time and identical output:
In [30]: len(img["image"])
Out[30]: 371177
extract_font(xref, info_only=False)
PDF Only: Return an embedded font file’s data and appropriate file extension. This can be
used to store the font as an external file. The method does not throw exceptions (other than
via checking for PDF and valid xref).
arg int xref PDF object number of the font to extract.
arg bool info_only only return font information, not the buffer. To be used for
information-only purposes, avoids allocation of large buffer areas.
rtype tuple
returns a tuple (basename, ext, subtype, buffer), where ext is a 3-byte suggested
file extension (str), basename is the font’s name (str), subtype is the font’s
type (e.g. “Type1”) and buffer is a bytes object containing the font file’s
content (or b””). For possible extension values and their meaning see Font
File Extensions. Return details on error:
• (“”, “”, “”, b””) – invalid xref or xref is not a (valid) font object.
• (basename, “n/a”, “Type1”, b””) – basename is not embedded and thus
cannot be extracted. This is the case for e.g. the PDF Base 14 Fonts.
Example:
Warning: The basename is returned unchanged from the PDF. So it may contain char-
acters (such as blanks) which may disqualify it as a filename for your operating system.
Take appropriate action.
xref_xml_metadata()
(New in version 1.16.8)
PDF only: Return the xref of the document’s XML metadata.
xref_stream(xref )
(New in version 1.16.8)
PDF only: Return the decompressed contents of the xref stream object.
Parameters xref (int) – xref number.
Return type bytes
Returns the (decompressed) stream of the object.
xref_stream_raw(xref )
(New in version 1.16.8)
PDF only: Return the unmodified (esp. not decompressed) contents of the xref stream object.
Otherwise equal to Document.xref_stream().
Return type bytes
Returns the (original, unmodified) stream of the object.
update_object(xref, obj_str, page=None)
(New in version 1.16.8)
PDF only: Replace object definition of xref with the provided string. The xref may also be new,
in which case this instruction completes the object definition. If a page object is also given, its links
and annotations will be reloaded afterwards.
Parameters
• xref (int) – xref number.
• obj_str (str) – a string containing a valid PDF object definition.
• page (Page) – a page object. If provided, indicates, that annotations of this
page should be refreshed (reloaded) to reflect changes incurred with links and /
or annotations.
Return type int
Returns zero if successful, otherwise an exception will be raised.
update_stream(xref, data, new=False, compress=True)
• New in v.1.16.8
• Changed in v1.19.2: added parameter “compress”
Replace the stream of an object identified by xref. If the object has no stream, an exception is raised
unless new=True is used. The function automatically performs a compress operation (“deflate”)
where beneficial.
Parameters
• xref (int) – xref number.
• stream (bytes|bytearray|BytesIO) – the new content of the stream.
(Changed in version 1.14.13:) io.BytesIO objects are now also supported.
• new (bool) – whether to force accepting the stream, and thus turning it into
a stream object.
• compress (bool) – whether to compress the inserted stream. If True (de-
fault), the stream will be inserted using /FlateDecode compression, other-
wise the stream will inserted as is.
Caution: The object of xref must be a PDF dictionary for this to work,
and especially must not be empty – as is the case if you just created the xref
via Document.get_new_xref(). To avoid this, at a minimum execute
doc.update_object(xref, "<<>>") before inserting the stream.
This method is primarily intended to manipulate streams containing PDF operator syntax (see pp.
643 of the Adobe PDF References) as it is the case for e.g. page content streams.
If you update a contents stream, you should use save parameter clean=True. This ensures consis-
tency between PDF operator source and the object structure.
Example: Let us assume that you no longer want a certain image appear on a page. This can be
achieved by deleting the respective reference in its contents source(s) – and indeed: the image will
be gone after reloading the page. But the page’s resources object would still show the image as
being referenced by the page. This save option will clean up any such mismatches.
has_links()
has_annots()
(New in v1.18.7)
PDF only: Check whether there are links, resp. annotations anywhere in the document.
Returns True / False. As opposed to fields, which are also stored in a central place of a
PDF document, the existence of links / annotations can only be detected by parsing
each page. These methods are tuned to do this efficiently and will immediately re-
turn, if the answer is True for a page. For PDFs with many thousand pages however,
an answer may take some time6 if no link, resp. no annotation is found.
subset_fonts()
(New in v1.18.7, changed in v1.18.9)
PDF only: Investigate eligible fonts for their use by text in the document. If a font is supported and
a size reduction is possible, that font is replaced by a version with a character subset.
Use this method immediately before saving the document. The following features and restrictions
apply for the time being:
6 For a False the complete document must be scanned. Both methods do not load pages, but only scan object definitions. This makes them
at least 10 times faster than application-level loops (where total response time roughly equals the time for loading all pages). For the Adobe PDF
References (756 pages) and the Pandas documentation (over 3’070 pages) – both havo no annotations – the method needs about 11 ms for the
answer False. So response times will probably become significant only well beyond this order of magnitude.
• Package fontTools must be installed. It is required for creating the font subsets. If not
installed, the method raises an ImportError exception.
• Supported font types only include embedded OTF, TTF and WOFF that are not already sub-
sets.
• The script directory must be available for writing temporary files during the subsetting pro-
cess.
• Changed in v1.18.9: A subset font directly replaces its original – text remains untouched and
is not rewritten. It thus should retain all its properties, like spacing, hiddenness, control by
Optional Content, etc.
The greatest benefit can be achieved when creating new PDFs using large fonts like is typical for
Asian scripts. In these cases, the set of actually used unicodes mostly is small compared to the
number of glyphs in the font. Using this feature can easily reduce the embedded font binary by two
orders of magnitude – from several megabytes to a low two-digit kilobyte amount.
journal_enable()
• New in v1.19.0
PDF only: Enable journalling. Use this before you start logging operations.
journal_start_op(name)
• New in v1.19.0
PDF only: Start journalling an “operation” identified by a string “name”. Updates will fail for a
journal-enabled PDF, if no operation has been started.
journal_stop_op()
• New in v1.19.0
PDF only: Stop the current operation. The updates between start and stop of an operation belong
to the same unit of work and will be undone / redone together.
journal_position()
• New in v1.19.0
PDF only: Return the numbers of the current operation and the total operation count.
Returns a tuple (step, steps) containing the current operation number and the
total number of operations in the journal. If step is 0, we are at the top of the journal.
If step equals steps, we are at the bottom. Updating the PDF with anything other
than undo or redo will automatically remove all journal entries after the current one
and the new update will become the new last entry in the journal. The updates
corresponding to the removed journal entries will be permanently lost.
journal_op_name(step)
• New in v1.19.0
PDF only: Return the name of operation number step.
journal_can_do()
• New in v1.19.0
PDF only: Show whether forward (“redo”) and / or backward (“undo”) executions are possible from
the current journal postion.
Returns a dictionary {"undo": bool, "redo": bool}. The respective
method is available if its value is True.
journal_undo()
• New in v1.19.0
PDF only: Revert (undo) the current step in the journal. This moves towards the journal’s top.
journal_redo()
• New in v1.19.0
PDF only: Re-apply (redo) the current step in the journal. This moves towards the journal’s bottom.
journal_save(filename)
• New in v1.19.0
PDF only: Save the journal to a file.
Parameters filename (str,fp) – either a filename as string or a file object opened
as “wb” (or an io.BytesIO() object).
journal_load(filename)
• New in v1.19.0
PDF only: Load journal from a file. Enables journalling for the document. If journalling is already
enabled, an exception is raised.
Parameters filename (str,fp) – the filename (str) of the journal or a file object
opened as “rb” (or an io.BytesIO() object).
save_snapshot()
• New in v1.19.0
PDF only: Saves a “snapshot” of the document. This is a PDF document with a special,
incremental-save format compatible with journalling – therefore no save options are available. Sav-
ing a snapshot is not possible for new documents.
This is a normal PDF document with no usage restrictions whatsoever. If it is not being changed in
any way, it can be used together with its journal to undo / redo operations or continue updating.
outline
Contains the first Outline entry of the document (or None). Can be used as a starting point to walk
through all outline items. Accessing this property for encrypted, not authenticated documents will
raise an AttributeError.
Type Outline
is_closed
False if document is still open. If closed, most other attributes and methods will have been deleted
/ disabled. In addition, Page objects referring to this document (i.e. created with Document.
load_page()) and their dependent objects will no longer be usable. For reference purposes,
Document.name still exists and will contain the filename of the original document (if applicable).
Type bool
is_dirty
True if this is a PDF document and contains unsaved changes, else False.
Type bool
is_pdf
True if this is a PDF document, else False.
Type bool
is_form_pdf
False if this is not a PDF or has no form fields, otherwise the number of root form fields (fields with
no ancestors).
(Changed in version 1.16.4) Returns the total number of (root) form fields.
Type bool,int
is_reflowable
True if document has a variable page layout (like e-books or HTML). In this case you can set the
desired page dimensions during document creation (open) or via method layout().
Type bool
is_repaired
(New in v1.18.2)
True if PDF has been repaired during open (because of major structure issues). Always False for
non-PDF documents. If true, more details have been stored in TOOLS.mupdf_warnings(),
and Document.can_save_incrementally() will return False.
Type bool
needs_pass
Indicates whether the document is password-protected against access. This indicator remains un-
changed – even after the document has been authenticated. Precludes incremental saves if true.
Type bool
is_encrypted
This indicator initially equals Document.needs_pass. After successful authentication, it is set
to False to reflect the situation.
Type bool
permissions
Contains the permissions to access the document. This is an integer containing bool values in
respective bit positions. For example, if doc.permissions & fitz.PDF_PERM_MODIFY > 0, you
may change the document. See Document Permissions for details.
Changed in version 1.16.0 This is now an integer comprised of bit indicators. Was a dictionary
previously.
Type int
metadata
Contains the document’s meta data as a Python dictionary or None (if is_encrypted=True and need-
Pass=True). Keys are format, encryption, title, author, subject, keywords, creator, producer, cre-
ationDate, modDate, trapped. All item values are strings or None.
Except format and encryption, for PDF documents, the key names correspond in an obvious way
to the PDF keys /Creator, /Producer, /CreationDate, /ModDate, /Title, /Author, /Subject, /Trapped
and /Keywords respectively.
• format contains the document format (e.g. ‘PDF-1.6’, ‘XPS’, ‘EPUB’).
• encryption either contains None (no encryption), or a string naming an encryption method
(e.g. ‘Standard V4 R4 128-bit RC4’). Note that an encryption method may be specified
even if needs_pass=False. In such cases not all permissions will probably have been granted.
Check Document.permissions for details.
• If the date fields contain valid data (which need not be the case at all!), they are strings in the
PDF-specific timestamp format “D:<TS><TZ>”, where
Type dict
name
Contains the filename or filetype value with which Document was created.
Type str
page_count
Contains the number of pages of the document. May return 0 for documents with no pages. Func-
tion len(doc) will also deliver this result.
Type int
chapter_count
(New in version 1.17.0) Contains the number of chapters in the document. Always at least 1.
Relevant only for document types with chapter support (EPUB currently). Other documents will
return 1.
Type int
last_location
(New in version 1.17.0) Contains (chapter, pno) of the document’s last page. Relevant only for
document types with chapter support (EPUB currently). Other documents will return (0, len(doc) -
1) and (0, -1) if it has no pages.
Type int
FormFonts
A list of form field font names defined in the /AcroForm object. None if not a PDF.
Type list
Note: For methods that change the structure of a PDF (insert_pdf(), select(), copy_page(),
delete_page() and others), be aware that objects or properties in your program may have been invalidated or
orphaned. Examples are Page objects and their children (links, annotations, widgets), variables holding old page
counts, tables of content and the like. Remember to keep such variables up to date or delete orphaned objects. Also
refer to Ensuring Consistency of Important Objects in PyMuPDF.
Clear metadata information. If you do this out of privacy / data protection concerns, make sure you save the document
as a new file with garbage > 0. Only then the old /Info object will also be physically removed from the file. In this
case, you may also want to clear any XML metadata inserted by several PDF editors:
>>> import fitz
>>> doc=fitz.open("pymupdf.pdf")
>>> doc.metadata # look at what we currently have
{'producer': 'rst2pdf, reportlab', 'format': 'PDF 1.4', 'encryption': None, 'author':
(continues on next page)
This shows how to modify or add a table of contents. Also have a look at csv2toc.py and toc2csv.py in the examples
directory.
Obviously, similar ways can be found in more general situations. Just make sure that hierarchy levels in a row do not
increase by more than one. Inserting dummy bookmarks before and after toc2 segments would heal such cases. A
ready-to-use GUI (wxPython) solution can be found in script PDFjoiner.py of the examples directory.
(2) More examples:
>>> # insert 5 pages of doc2, where its page 21 becomes page 15 in doc1
>>> doc1.insert_pdf(doc2, from_page=21, to_page=25, start_at=15)
>>> # same example, but pages are rotated and copied in reverse order
>>> doc1.insert_pdf(doc2, from_page=25, to_page=21, start_at=15, rotate=90)
for i in range(len(doc)):
imglist = doc.get_page_images(i)
for img in imglist:
xref = img[0] # xref number
pix = fitz.Pixmap(doc, xref) # make pixmap from image
if pix.n - pix.alpha < 4: # can be saved as PNG
pix.save("p%s-%s.png" % (i, xref))
else: # CMYK: must convert first
pix0 = fitz.Pixmap(fitz.csRGB, pix)
pix0.save("p%s-%s.png" % (i, xref))
pix0 = None # free Pixmap resources
pix = None # free Pixmap resources
6.5 Font
(New in v1.16.18) This class represents a font as defined in MuPDF (fz_font_s structure). It is required for the new class
TextWriter and the new Page.write_text(). Currently, it has no connection to how fonts are used in methods
Page.insert_text() or Page.insert_textbox(), respectively.
A Font object also contains useful general information, like the font bbox, the number of defined glyphs, glyph names
or the bbox of a single glyph.
Class API
class Font
Argu- Action
ment
fontfile? Create font from file, exception if failure.
font- Create font from buffer, exception if failure.
buffer?
order- Create universal font, always succeeds.
ing>=0
font- Create a Base-14 font, universal font, or font provided by pymupdf-
name? fonts. See table below.
Note: With the usual reserved names “helv”, “tiro”, etc., you will create fonts with the expected names
“Helvetica”, “Times-Roman” and so on. However, and in contrast to Page.insert_font() and
friends,
• a font file will always be embedded in your PDF,
• Greek and Cyrillic characters are supported without needing the encoding parameter.
Using ordering >= 0, or fontnames “cjk”, “china-t”, “china-s”, “japan” or “korea” will always create
the same “universal” font “Droid Sans Fallback Regular”. This font supports all Chinese, Japanese,
Korean and Latin characters, including Greek and Cyrillic. This is a sans-serif font.
Actually, you would rarely ever need another sans-serif font than “Droid Sans Fallback Regular”. Ex-
cept that this font file is relatively large and adds about 1.65 MB (compressed) to your PDF file size. If
you do not need CJK support, stick with specifying “helv”, “tiro” etc., and you will get away with about
35 KB compressed.
If you know you have a mixture of CJK and Latin text, consider just using Font("cjk") because this
supports everything and also significantly (by a factor of up to three) speeds up execution: MuPDF will
always find any character in this single font and never needs to check fallbacks.
But if you do use some other font, you will still automatically be able to also write CJK characters:
MuPDF detects this situation and silently falls back to the universal font (which will then of course also
be embedded in your PDF).
(New in v1.17.5) Optionally, some new “reserved” fontname codes become available if you install
pymupdf-fonts, pip install pymupdf-fonts. “Fira Mono” is a mono-spaced sans font set and
FiraGO is another non-serifed “universal” font set which supports all Latin (including Cyrillic and Greek)
plus Thai, Arabian, Hewbrew and Devanagari – but none of the CJK languages. The size of a FiraGO font
is only a quarter of the “Droid Sans Fallback” size (compressed 400 KB vs. 1.65 MB) – and it provides
the weights bold, italic, bold-italic – which the universal font doesn’t.
“Space Mono” is another nice and small mono-spaced font from Google Fonts, which supports Latin
Extended characters and comes with all 4 important weights.
The following table maps a fontname code to the corresponding font:
0000
(continues on next page)
2 The built-in module array has been chosen for its speed and its compact representation of values.
Note: This method only returns meaningful data for fonts having a CMAP (character map, charmap, the
/ToUnicode PDF key). Otherwise, this array will have length 1 and contain zero only.
text_length(text, fontsize=11)
Calculate the length in points of a unicode string.
Note: There is a functional overlap with get_text_length() for Base-14 fonts only.
Parameters
• text (str) – a text string, UTF-8 encoded.
• fontsize (float) – the fontsize.
Return type float
Returns
the length of the string in points when stored in the PDF. If a character is not contained
in the font, it will automatically be looked up in a fallback font.
Note: This method was originally implemented in Python, based on calling Font.
glyph_advance(). For performance reasons, it has been rewritten in C for
v1.18.14. To compute the width of a single character, you can now use either of the
following without performance penalty:
1. font.glyph_advance(ord("Ä")) * fontsize
2. font.text_length("Ä", fontsize=fontsize)
For multi-character strings, the method offers a huge performance advantage compared
to the previous implementation: instead of about 0.5 microseconds for each character,
only 12.5 nanoseconds are required for the second and subsequent ones.
char_lengths(text, fontsize=11)
New in v1.18.14
Sequence of character lengths in points of a unicode string.
Parameters
• text (str) – a text string, UTF-8 encoded.
• fontsize (float) – the fontsize.
Return type tuple
Returns
the lengths in points of the characters of a string when stored in the PDF. It works
like Font.text_length() broken down to single characters. This is a high
speed method, used e.g. in TextWriter.fill_textbox(). The following is
true (allowing rounding errors): font.text_length(text) == sum(font.
char_lengths(text)).
buffer
(New in v1.17.6)
Copy of the binary font file content.
Return type bytes
flags
A dictionary with various font properties, each represented as bools. Example for Helvetica:
>>> pprint(font.flags)
{'bold': 0,
'fake-bold': 0,
'fake-italic': 0,
'invalid-bbox': 0,
'italic': 0,
'mono': 0,
'opentype': 0,
'serif': 1,
'stretch': 0,
'substitute': 0}
name
Return type str
6.6 Identity
Identity is a Matrix that performs no action – to be used whenever the syntax requires a matrix, but no actual transfor-
mation should take place. It has the form fitz.Matrix(1, 0, 0, 1, 0, 0).
Identity is a constant, an “immutable” object. So, all of its matrix properties are read-only and its methods are disabled.
If you need a mutable identity matrix as a starting point, use one of the following statements:
6.7 IRect
IRect is a rectangular bounding box, very similar to Rect, except that all corner coordinates are integers. IRect is used
to specify an area of pixels, e.g. to receive image data during rendering. Otherwise, e.g. considerations concerning
emptiness and validity of rectangles also apply to this class. Methods and attributes have the same names, and in many
cases are implemented by re-using the respective Rect counterparts.
Class API
class IRect
__init__(self )
__init__(self, x0, y0, x1, y1)
__init__(self, irect)
__init__(self, sequence)
Overloaded constructors. Also see examples below and those for the Rect class.
If another irect is specified, a new copy will be made.
If sequence is specified, it must be a Python sequence type of 4 numbers (see Using Python Sequences
as Arguments in PyMuPDF). Non-integer numbers will be truncated, non-numeric values will raise an
exception.
The other parameters mean integer coordinates.
get_area([unit ])
Calculates the area of the rectangle and, with no parameter, equals abs(IRect). Like an empty rectangle,
the area of an infinite rectangle is also zero.
Parameters unit (str) – Specify required unit: respective squares of “px” (pixels, de-
fault), “in” (inches), “cm” (centimeters), or “mm” (millimeters).
Return type float
intersect(ir)
The intersection (common rectangular area) of the current rectangle and ir is calculated and replaces the
current rectangle. If either rectangle is empty, the result is also empty. If either rectangle is infinite, the
other one is taken as the result – and hence also infinite if both rectangles were infinite.
Parameters ir (rect_like) – Second rectangle.
contains(x)
Checks whether x is contained in the rectangle. It may be rect_like, point_like or a number. If
x is an empty rectangle, this is always true. Conversely, if the rectangle is empty this is always False, if
x is not an empty rectangle and not a number. If x is a number, it will be checked to be one of the four
components. x in irect and irect.contains(x) are equivalent.
Parameters x (IRect or Rect or Point or int) – the object to check.
Return type bool
intersects(r)
Checks whether the rectangle and the rect_like “r” contain a common non-empty IRect. This will
always be False if either is infinite or empty.
Parameters r (rect_like) – the rectangle to check.
Return type bool
torect(rect)
(New in version 1.19.3)
Compute the matrix which transforms this rectangle to a given one. See Rect.torect().
Parameters rect (rect_like) – the target rectangle. Must not be empty or infinite.
Return type Matrix
Returns a matrix mat such that self * mat = rect. Can for example be used to
transform between the page and the pixmap coordinates.
morph(fixpoint, matrix)
(New in version 1.17.0)
Return a new quad after applying a matrix to it using a fixed point.
Parameters
• fixpoint (point_like) – the fixed point.
• matrix (matrix_like) – the matrix.
Returns a new Quad. This a wrapper of the same-named quad method. If infinite, the
infinite quad is returned.
norm()
(New in version 1.16.0)
Return the Euclidean norm of the rectangle treated as a vector of four numbers.
normalize()
Make the rectangle finite. This is done by shuffling rectangle corners. After this, the bottom right corner
will indeed be south-eastern to the top left one. See Rect for a more details.
top_left
tl
Equals Point(x0, y0).
Type Point
top_right
tr
Equals Point(x1, y0).
Type Point
bottom_left
bl
Equals Point(x0, y1).
Type Point
bottom_right
br
Equals Point(x1, y1).
Type Point
rect
The Rect with the same coordinates as floats.
Type Rect
quad
The quadrilateral Quad(irect.tl, irect.tr, irect.bl, irect.br).
Type Quad
width
Contains the width of the bounding box. Equals abs(x1 - x0).
Type int
height
Contains the height of the bounding box. Equals abs(y1 - y0).
Type int
x0
X-coordinate of the left corners.
Type int
y0
Y-coordinate of the top corners.
Type int
x1
X-coordinate of the right corners.
Type int
y1
Y-coordinate of the bottom corners.
Type int
is_infinite
True if rectangle is infinite, False otherwise.
Type bool
is_empty
True if rectangle is empty, False otherwise.
Type bool
Note:
• This class adheres to the Python sequence protocol, so components can be accessed via their index, too. Also
refer to Using Python Sequences as Arguments in PyMuPDF.
• Rectangles can be used with arithmetic operators – see chapter Operator Algebra for Geometry Objects.
6.8 Link
Represents a pointer to somewhere (this document, other documents, the internet). Links exist per document page, and
they are forward-chained to each other, starting from an initial link which is accessible by the Page.first_link
property.
There is a parent-child relationship between a link and its page. If the page object becomes unusable (closed document,
any document structure change, etc.), then so does every of its existing link objects – an exception is raised saying that
the object is “orphaned”, whenever a link property or method is accessed.
Class API
class Link
Note: In PDF, links are a subtype of annotations technically and do not support fill colors. However, to
keep a consistent API, we do allow specifying a fill= parameter like with all annotations, which will
be ignored with a warning.
(Changed in version 1.16.9) Allow colors to be directly set. These parameters are used if colors is not a
dictionary.
Parameters
• colors (dict) – a dictionary containing color specifications. For accepted dic-
tionary keys and values see below. The most practical way should be to first make
a copy of the colors property and then modify this dictionary as required.
• stroke (sequence) – see above.
set_flags(flags)
New in v1.18.16
Set the PDF /F property of the link annotation. See Annot.set_flags() for details. If not a PDF,
this method is a no-op.
flags
New in v1.18.16
Return the link annotation flags, an integer (see Annot.flags for details). Zero if not a PDF.
colors
Meaningful for PDF only: A dictionary of two tuples of floats in range 0 <= float <= 1 specifying
the stroke and the interior (fill) colors. If not a PDF, None is returned. As mentioned above, the fill color
is always None for links. The stroke color is used for the border of the link rectangle. The length of the
tuple implicitely determines the colorspace: 1 = GRAY, 3 = RGB, 4 = CMYK. So (1.0, 0.0, 0.0)
stands for RGB color red. The value of each float f is mapped to the integer value i in range 0 to 255 via
the computation f = i / 255.
Return type dict
border
Meaningful for PDF only: A dictionary containing border characteristics. It will be None for non-PDFs
and an empty dictionary if no border information exists. The following keys can occur:
• width – a float indicating the border thickness in points. The value is -1.0 if no width is specified.
• dashes – a sequence of integers specifying a line dash pattern. [] means no dashes, [n] means equal
on-off lengths of n points, longer lists will be interpreted as specifying alternating on-off length
values. See the Adobe PDF References page 126 for more detail.
• style – 1-byte border style: S (Solid) = solid rectangle surrounding the annotation, D (Dashed) =
dashed rectangle surrounding the link, the dash pattern is specified by the dashes entry, B (Beveled)
= a simulated embossed rectangle that appears to be raised above the surface of the page, I (Inset)
= a simulated engraved rectangle that appears to be recessed below the surface of the page, U
(Underline) = a single line along the bottom of the annotation rectangle.
rect
The area that can be clicked in untransformed coordinates.
Type Rect
isExternal
A bool specifying whether the link target is outside of the current document.
Type bool
uri
A string specifying the link target. The meaning of this property should be evaluated in conjunction with
property isExternal. The value may be None, in which case isExternal == False. If uri starts with file://,
mailto:, or an internet resource name, isExternal is True. In all other cases isExternal == False and uri
points to an internal location. In case of PDF documents, this should either be #nnnn to indicate a 1-based
(!) page number nnnn, or a named location. The format varies for other document types, e.g. uri =
‘../FixedDoc.fdoc#PG_2_LNK_1’ for page number 2 (1-based) in an XPS document.
Type str
xref
An integer specifying the PDF xref. Zero if not a PDF.
Type int
next
The next link or None.
Type Link
dest
The link destination details object.
Type linkDest
6.9 linkDest
Class representing the dest property of an outline entry or a link. Describes the destination to which such entries point.
Note: Up to MuPDF v1.9.0 this class existed inside MuPDF and was dropped in version 1.10.0. For backward
compatibility, PyMuPDF is still maintaining it, although some of its attributes are no longer backed by data actually
available via MuPDF.
Class API
class linkDest
dest
Target destination name if linkDest.kind is LINK_GOTOR and linkDest.page is -1.
Type str
fileSpec
Contains the filename and path this link points to, if linkDest.kind is LINK_GOTOR or
LINK_LAUNCH.
Type str
flags
A bitfield describing the validity and meaning of the different aspects of the destination. As far as possible,
link destinations are constructed such that e.g. linkDest.lt and linkDest.rb can be treated as
defining a bounding box. But the flags indicate which of the values were actually specified, see Link
Destination Flags.
Type int
isMap
This flag specifies whether to track the mouse position when the URI is resolved. Default value: False.
Type bool
isUri
Specifies whether this destination is an internet resource (as opposed to e.g. a local file specification in
URI format).
Type bool
kind
Indicates the type of this destination, like a place in this document, a URI, a file launch, an action or a
place in another file. Look at Link Destination Kinds to see the names and numerical values.
Type int
lt
The top left Point of the destination.
Type Point
named
This destination refers to some named action to perform (e.g. a javascript, see Adobe PDF References).
Standard actions provided are NextPage, PrevPage, FirstPage, and LastPage.
Type str
newWindow
If true, the destination should be launched in a new window.
Type bool
page
The page number (in this or the target document) this destination points to. Only set if linkDest.
kind is LINK_GOTOR or LINK_GOTO. May be -1 if linkDest.kind is LINK_GOTOR. In this case
linkDest.dest contains the name of a destination in the target document.
Type int
rb
The bottom right Point of this destination.
Type Point
uri
The name of the URI this destination points to.
Type str
6.10 Matrix
Matrix is a row-major 3x3 matrix used by image transformations in MuPDF (which complies with the respective
concepts laid down in the Adobe PDF References). With matrices you can manipulate the rendered image of a page
in a variety of ways: (parts of) the page can be rotated, zoomed, flipped, sheared and shifted by setting some or all of
just six float values.
Since all points or pixels live in a two-dimensional space, one column vector of that matrix is a constant unit vector,
and only the remaining six elements are used for manipulations. These six elements are usually represented by [a, b,
c, d, e, f]. Here is how they are positioned in the matrix:
Please note:
• the below methods are just convenience functions – everything they do, can also be achieved by directly manip-
ulating the six numerical values
• all manipulations can be combined – you can construct a matrix that rotates and shears and scales and shifts,
etc. in one go. If you however choose to do this, do have a look at the remarks further down or at the Adobe
PDF References.
Class API
class Matrix
__init__(self )
__init__(self, zoom-x, zoom-y)
__init__(self, shear-x, shear-y, 1)
__init__(self, a, b, c, d, e, f )
__init__(self, matrix)
__init__(self, degree)
__init__(self, sequence)
Overloaded constructors.
Without parameters, the zero matrix Matrix(0.0, 0.0, 0.0, 0.0, 0.0, 0.0) will be created.
zoom-* and shear-* specify zoom or shear values (float) and create a zoom or shear matrix, respectively.
For “matrix” a new copy of another matrix will be made.
Float value “degree” specifies the creation of a rotation matrix which rotates anit-clockwise.
A “sequence” must be any Python sequence object with exactly 6 float entries (see Using Python Se-
quences as Arguments in PyMuPDF).
fitz.Matrix(1, 1), fitz.Matrix(0.0) and fitz.Matrix(fitz.Identity) create modifyable versions of the Identity
matrix, which looks like [1, 0, 0, 1, 0, 0].
norm()
(New in version 1.16.0)
Return the Euclidean norm of the matrix as a vector.
prerotate(deg)
Modify the matrix to perform a counter-clockwise rotation for positive deg degrees, else clockwise. The
matrix elements of an identity matrix will change in the following way:
[1, 0, 0, 1, 0, 0] -> [cos(deg), sin(deg), -sin(deg), cos(deg), 0, 0].
Parameters deg (float) – The rotation angle in degrees (use conventional notation based
on Pi = 180 degrees).
prescale(sx, sy)
Modify the matrix to scale by the zoom factors sx and sy. Has effects on attributes a thru d only: [a, b, c,
d, e, f] -> [a*sx, b*sx, c*sy, d*sy, e, f].
Parameters
• sx (float) – Zoom factor in X direction. For the effect see description of attribute
a.
• sy (float) – Zoom factor in Y direction. For the effect see description of attribute
d.
preshear(sx, sy)
Modify the matrix to perform a shearing, i.e. transformation of rectangles into parallelograms (rhom-
boids). Has effects on attributes a thru d only: [a, b, c, d, e, f] -> [c*sy, d*sy, a*sx, b*sx, e, f].
Parameters
• sx (float) – Shearing effect in X direction. See attribute c.
f
Causes a vertical shift effect: Each Point(x, y) will become Point(x, y - f). Positive (negative) values of f
will shift down (up).
Type float
is_rectilinear
Rectilinear means that no shearing is present and that any rotations are integer multiples of 90 degrees.
Usually this is used to confirm that (axis-aligned) rectangles before the transformation are still axis-
aligned rectangles afterwards.
Type bool
Note:
• This class adheres to the Python sequence protocol, so components can be accessed via their index, too. Also
refer to Using Python Sequences as Arguments in PyMuPDF.
• A matrix can be used with arithmetic operators – see chapter Operator Algebra for Geometry Objects.
• Changes of matrix properties and execution of matrix methods can be executed consecutively. This is the same
as multiplying the respective matrices.
• Matrix multiplication is not commutative – changing the execution sequence in general changes the result. So
it can quickly become unclear which result a transformation will yield.
6.10.1 Examples
Here are examples to illustrate some of the effects achievable. The following pictures start with a page of the PDF
version of this help file. We show what happens when a matrix is being applied (though always full pages are created,
only parts are displayed here to save space).
This is the original page image:
6.10.2 Shifting
6.10.3 Flipping
6.10.4 Shearing
6.10.5 Rotating
6.11 Outline
outline (or “bookmark”), is a property of Document. If not None, it stands for the first outline item of the document.
Its properties in turn define the characteristics of this item and also point to other outline items in “horizontal” or
downward direction. The full tree of all outline items for e.g. a conventional table of contents (TOC) can be recovered
by following these “pointers”.
Class API
class Outline
down
The next outline item on the next level down. Is None if the item has no kids.
Type Outline
next
The next outline item at the same level as this item. Is None if this is the last one in its level.
Type Outline
page
The page number (0-based) this bookmark points to.
Type int
title
The item’s title as a string or None.
Type str
is_open
Indicator showing whether any sub-outlines should be expanded (True) or be collapsed (False). This
information is interpreted by PDF reader software.
Type bool
is_external
A bool specifying whether the target is outside (True) of the current document.
Type bool
uri
A string specifying the link target. The meaning of this property should be evaluated in conjunction with
isExternal. The value may be None, in which case isExternal == False. If uri starts with file://, mailto:,
or an internet resource name, isExternal is True. In all other cases isExternal == False and uri points
to an internal location. In case of PDF documents, this should either be #nnnn to indicate a 1-based
(!) page number nnnn, or a named location. The format varies for other document types, e.g. uri =
‘../FixedDoc.fdoc#PG_21_LNK_84’ for page number 21 (1-based) in an XPS document.
Type str
dest
The link destination details object.
Type linkDest
6.12 Page
Class representing a document page. A page object is created by Document.load_page() or, equivalently, via
indexing the document like doc[n] - it has no independent constructor.
There is a parent-child relationship between a document and its pages. If the document is closed or deleted, all page
objects (and their respective children, too) in existence will become unusable (“orphaned”): If a page property or
method is being used, an exception is raised.
Several page methods have a Document counterpart for convenience. At the end of this chapter you will find a synopsis.
Changing page properties and adding or changing page content is available for PDF documents only.
In a nutshell, this is what you can do with PyMuPDF:
• Modify page rotation and the visible part (“cropbox”) of the page.
• Insert images, other PDF pages, text and simple geometrical objects.
• Add annotations and form fields.
Note: Methods require coordinates (points, rectangles) to put content in desired places. Please be aware that since
v1.17.0 these coordinates must always be provided relative to the unrotated page. The reverse is also true: expcept
Page.rect, resp. Page.bound() (both reflect when the page is rotated), all coordinates returned by methods and
attributes pertain to the unrotated page.
So the returned value of e.g. Page.get_image_bbox() will not change if you do a Page.set_rotation().
The same is true for coordinates returned by Page.get_text(), annotation rectangles, and so on. If you
want to find out, where an object is located in rotated coordinates, multiply the coordinates with Page.
rotation_matrix. There also is its inverse, Page.derotation_matrix, which you can use when inter-
facing with other readers, which may behave differently in this respect.
Note: If you add or update annotations, links or form fields on the page and immediately afterwards need to work
with them (i.e. without leaving the page), you should reload the page using Document.reload_page() before
referring to these new or updated items.
This ensures all your changes have been fully applied to PDF structures, so can safely create Pixmaps or successfully
iterate over annotations, links and form fields.
Class API
class Page
bound()
Determine the rectangle of the page. Same as property Page.rect below. For PDF documents this
usually also coincides with mediabox and cropbox, but not always. For example, if the page is
rotated, then this is reflected by this method – the Page.cropbox however will not change.
Return type Rect
add_caret_annot(point)
(New in version 1.16.0)
PDF only: Add a caret icon. A caret annotation is a visual symbol normally used to indicate the presence
of text edits on the page.
Parameters point (point_like) – the top left point of a 20 x 20 rectangle containing
the MuPDF-provided icon.
Return type Annot
Returns the created annotation. Stroke color blue = (0, 0, 1), no fill color support.
accompanying text is hidden and can be visualized by many PDF viewers by hovering the mouse over the
symbol.
Parameters
• point (point_like) – the top left point of a 20 x 20 rectangle containing the
MuPDF-provided “note” icon.
• text (str) – the commentary text. This will be shown on double clicking or
hovering over the icon. May contain any Latin characters.
• icon (str) – (new in version 1.16.0) choose one of “Note” (default), “Comment”,
“Help”, “Insert”, “Key”, “NewParagraph”, “Paragraph” as the visual symbol for
the embodied text4 .
Return type Annot
Returns the created annotation. Stroke color yellow = (1, 1, 0), no fill color support.
add_freetext_annot(rect, text, fontsize=12, fontname="helv", text_color=0, fill_color=1, ro-
tate=0, align=TEXT_ALIGN_LEFT)
PDF only: Add text in a given rectangle.
Parameters
• rect (rect_like) – the rectangle into which the text should be inserted. Text
is automatically wrapped to a new line at box width. Lines not fitting into the box
will be invisible.
• text (str) – the text. (New in v1.17.0) May contain any mixture of Latin, Greek,
Cyrillic, Chinese, Japanese and Korean characters. The respective required font is
automatically determined.
• fontsize (float) – the font size. Default is 12.
• fontname (str) – the font name. Default is “Helv”. Accepted alternatives are
“Cour”, “TiRo”, “ZaDb” and “Symb”. The name may be abbreviated to the first
two characters, like “Co” for “Cour”. Lower case is also accepted. (Changed
in v1.16.0) Bold or italic variants of the fonts are no longer accepted. A user-
contributed script provides a circumvention for this restriction – see section Using
Buttons and JavaScript in chapter Collection of Recipes. (New in v1.17.0) The
actual font to use is now determined on a by-character level, and all required fonts
(or sub-fonts) are automatically included. Therefore, you should rarely ever need
to care about this parameter and let it default (except you insist on a serifed font
for your non-CJK text parts).
• text_color (sequence,float) – (new in version 1.16.0) the text color. De-
fault is black.
• fill_color (sequence,float) – (new in version 1.16.0) the fill color. De-
fault is white.
• align (int) – (new in version 1.17.0) text alignment, one of
TEXT_ALIGN_LEFT, TEXT_ALIGN_CENTER, TEXT_ALIGN_RIGHT -
justify is not supported.
• rotate (int) – the text orientation. Accepted values are 0, 90, 270, invalid
entries are set to zero.
Return type Annot
4 You are generally free to choose any of the Annotation Icons in MuPDF you consider adequate.
Returns the created annotation. Color properties can only be changed using special param-
eters of Annot.update(). There, you can also set a border color different from the
text color.
add_file_annot(pos, buffer, filename, ufilename=None, desc=None, icon="PushPin")
PDF only: Add a file attachment annotation with a “PushPin” icon at the specified location.
Parameters
• pos (point_like) – the top-left point of a 18x18 rectangle containing the
MuPDF-provided “PushPin” icon.
• buffer (bytes,bytearray,BytesIO) – the data to be stored (actual file
content, any data, etc.).
Changed in version 1.14.13 io.BytesIO is now also supported.
• filename (str) – the filename to associate with the data.
• ufilename (str) – the optional PDF unicode version of filename. Defaults to
filename.
• desc (str) – an optional description of the file. Defaults to filename.
• icon (str) – (new in version 1.16.0) choose one of “PushPin” (default), “Graph”,
“Paperclip”, “Tag” as the visual symbol for the attached data4 .
Return type Annot
Returns the created annotation. Stroke color yellow = (1, 1, 0), no fill color support.
add_ink_annot(list)
PDF only: Add a “freehand” scribble annotation.
Parameters list (sequence) – a list of one or more lists, each containing point_like
items. Each item in these sublists is interpreted as a Point through which a connecting
line is drawn. Separate sublists thus represent separate drawing lines.
Return type Annot
Returns the created annotation in default appearance black =(0, 0, 0),line width 1. No fill
color support.
add_line_annot(p1, p2)
PDF only: Add a line annotation.
Parameters
• p1 (point_like) – the starting point of the line.
• p2 (point_like) – the end point of the line.
Return type Annot
Returns the created annotation. It is drawn with line (stroke) color red = (1, 0, 0) and line
width 1. No fill color support. The annot rectangle is automatically created to contain
both points, each one surrounded by a circle of radius 3 * line width to make room for
any line end symbols.
add_rect_annot(rect)
add_circle_annot(rect)
PDF only: Add a rectangle, resp. circle annotation.
Parameters rect (rect_like) – the rectangle in which the circle or rectangle is drawn,
must be finite and not empty. If the rectangle is not equal-sided, an ellipse is drawn.
Note:
– For an existing font of the page, use its reference name as fontname (this is
item[4] of its entry in Page.get_fonts()).
– For a new, non-builtin font, proceed as follows:
• fontsize (float) – (New in v1.16.12) the fontsize to use for the replacing text.
If the text is too large to fit, several insertion attempts will be made, gradually
reducing the fontsize to no less than 4. If then the text will still not fit, no text
insertion will take place at all.
• align (int) – (New in v1.16.12) the horizontal alignment for the replacing text.
See insert_textbox() for available values. The vertical alignment is (ap-
proximately) centered if a PDF built-in font is used (CJK or PDF Base 14 Fonts).
• fill (sequence) – (New in v1.16.12) the fill color of the rectangle after apply-
ing the redaction. The default is white = (1, 1, 1), which is also taken if None is
specified. (Changed in v1.16.13) To suppress a fill color alltogether, specify False.
In this cases the rectangle remains transparent.
• text_color (sequence) – (New in v1.16.12) the color of the replacing text.
Default is black = (0, 0, 0).
• cross_out (bool) – (new in v1.17.2) add two diagonal lines to the annotation
rectangle.
Return type Annot
Returns the created annotation. (Changed in v1.17.2) Its standard appearance looks like a
red rectangle (no fill color), optionally showing two diagonal lines. Colors, line width,
dashing, opacity and blend mode can now be set and applied via Annot.update()
like with other annotations.
add_polyline_annot(points)
add_polygon_annot(points)
PDF only: Add an annotation consisting of lines which connect the given points. A Polygon’s first
and last points are automatically connected, which does not happen for a PolyLine. The rectangle is
automatically created as the smallest rectangle containing the points, each one surrounded by a circle of
radius 3 (= 3 * line width). The following shows a ‘PolyLine’ that has been modified with colors and line
ends.
Parameters points (list) – a list of point_like objects.
Return type Annot
Returns the created annotation. It is drawn with line color black, line width 1 no fill color
but fill color support. Use methods of Annot to make any changes to achieve something
like this:
Note: search_for() delivers a list of either Rect or Quad objects. Such a list can be directly used as
an argument for these annotation types and will deliver one common annotation for all occurrences of
the search string:
Note: Obviously, text marker annotations need to know what is the top, the bottom, the left, and the right
side of the area(s) to be marked. If the arguments are quads, this information is given by the sequence of
the quad points. In contrast, a rectangle delivers much less information – this is illustrated by the fact, that
4! = 24 different quads can be constructed with the four corners of a reactangle.
Therefore, we strongly recommend to use the quads option for text searches, to ensure correct annota-
tions. A similar consideration applies to marking text spans extracted with the “dict” / “rawdict” options
of Page.get_text(). For more details on how to compute quadrilaterals in this case, see section
“How to Mark Non-horizontal Text” of Collection of Recipes.
Parameters
• quads (rect_like,quad_like,list,tuple) – (Changed in v1.14.20)
the location(s) – rectangle(s) or quad(s) – to be marked. A list or tuple must consist
of rect_like or quad_like items (or even a mixture of either). Every item
must be finite, convex and not empty (as applicable). (Changed in v1.16.14) Set
this parameter to None if you want to use the following arguments.
• start (point_like) – (New in v1.16.14) start text marking at this point. De-
faults to the top-left point of clip.
• stop (point_like) – (New in v1.16.14) stop text marking at this point. De-
faults to the bottom-right point of clip.
• clip (rect_like) – (New in v1.16.14) only consider text lines intersecting this
area. Defaults to the page rectangle.
Return type Annot or (changed in v1.16.14) None
Returns the created annotation. (Changed in v1.16.14) If quads is an empty list, no anno-
tation is created.
Note: Starting with v1.16.14 you can use parameters start, stop and clip to highlight consecutive lines
between the points start and stop. Make use of clip to further reduce the selected line bboxes and thus
deal with e.g. multi-column pages. The following multi-line highlight on a page with three text columnbs
was created by specifying the two red points and setting clip accordingly.
add_stamp_annot(rect, stamp=0)
PDF only: Add a “rubber stamp” like annotation to e.g. indicate the document’s intended use (“DRAFT”,
“CONFIDENTIAL”, etc.).
Parameters
• rect (rect_like) – rectangle where to place the annotation.
• stamp (int) – id number of the stamp text. For available stamps see Stamp
Annotation Icons.
Note:
• The stamp’s text and its border line will automatically be sized and be put horizontally and vertically
centered in the given rectangle. Annot.rect is automatically calculated to fit the given width
and will usually be smaller than this parameter.
• The font chosen is “Times Bold” and the text will be upper case.
• The appearance can be changed using Annot.set_opacity() and by setting the “stroke” color
(no “fill” color supported).
• This can be used to create watermark images: on a temporary PDF page create a stamp annotation
with a low opacity value, make a pixmap from it with alpha=True (and potentially also rotate it),
discard the temporary PDF page and use the pixmap with insert_image() for your target PDF.
add_widget(widget)
PDF only: Add a PDF Form field (“widget”) to a page. This also turns the PDF into a Form PDF.
Because of the large amount of different options available for widgets, we have developed a new class
Widget, which contains the possible PDF field attributes. It must be used for both, form field creation and
updates.
Parameters widget (Widget) – a Widget object which must have been created upfront.
Returns a widget annotation.
delete_annot(annot)
PDF only: Delete annotation from the page and return the next one.
Changed in version 1.16.6 The removal will now include any bound ‘Popup’ or response annotations and
related objects.
Parameters annot (Annot) – the annotation to be deleted.
Return type Annot
Returns the annotation following the deleted one. Please remember that physical removal
requires saving to a new file with garbage > 0.
delete_widget(widget)
(New in v1.18.4)
PDF only: Delete field from the page and return the next one.
Parameters widget (Widget) – the widget to be deleted.
Return type Widget
Returns the widget following the deleted one. Please remember that physical removal re-
quires saving to a new file with garbage > 0.
apply_redactions(images=PDF_REDACT_IMAGE_PIXELS)
(New in version 1.16.11)
PDF only: Remove all text content contained in any redaction rectangle.
(Changed in v1.16.12) The previous mark parameter is gone. Instead, the respective rectangles are filled
with the individual fill color of each redaction annotation. If a text was given in the annotation, then
insert_textbox() is invoked to insert it, using parameters provided with the redaction.
This method applies and then deletes all redactions from the page.
Parameters images (int) – (new in v1.18.0) how to redact overlapping images. The de-
fault (2) blanks out overlapping pixels. PDF_REDACT_IMAGE_NONE (0) ignores,
and PDF_REDACT_IMAGE_REMOVE (1) completely removes all overlapping im-
ages.
Returns True if at least one redaction annotation has been processed, False otherwise.
Note:
• Text contained in a redaction rectangle will be physically removed from the page (assuming
Document.save() with a suitable garbage option) and will no longer appear in e.g. text ex-
tractions or anywhere else. All redaction annotations will also be removed. Other annotations are
unaffected.
• All overlapping links will be removed. If the rectangle of the link was covering text, then only the
overlapping part of the text is being removed. Similar applies to images covered by link rectangles.
• (Changed in v1.18.0) The overlapping parts of images will be blanked-out for default option
PDF_REDACT_IMAGE_PIXELS. Option 0 does not touch any images and 1 will remove any im-
age with an overlap. Please be aware that there is a bug for option PDF_REDACT_IMAGE_PIXELS
= 2: transparent images will be incorrectly handled!
• For option images=PDF_REDACT_IMAGE_REMOVE only this page’s references to the images
are removed - not necessarily the images themselves. Images are completely removed from the file
only, if no longer referenced at all (assuming suitable garbage collection options).
• For option images=PDF_REDACT_IMAGE_PIXELS a new image of format PNG is created,
which the page will use in place of the original one. The original image is not deleted or replaced
as part of this process, so other pages may still show the original. In addition, the new, modified
PNG image currently is stored uncompressed. Do keep these aspects in mind when choosing the
right garbage collection method and compression options during save.
• Text removal is done by character: A character is removed if its bbox has a non-empty over-
lap with a redaction rectangle (changed in MuPDF v1.17). Depending on the font properties
and / or the chosen line height, deletion may occur for undesired text parts. Using Tools.
set_small_glyph_heights() with a True argument before text search may help to prevent
this.
• Redactions are a simple way to replace single words in a PDF, or to just physically remove them.
Locate the word “secret” using some text extraction or search method and insert a redaction using
“xxxxxx” as replacement text for each occurrence.
– Be wary if the replacement is longer than the original – this may lead to an awkward appear-
ance, line breaks or no new text at all.
– For a number of reasons, the new text may not exactly be positioned on the same line like the
old one – especially true if the replacement font was not one of CJK or PDF Base 14 Fonts.
delete_link(linkdict)
PDF only: Delete the specified link from the page. The parameter must be an original item of
get_links() (see below). The reason for this is the dictionary’s “xref” key, which identifies the
PDF object to be deleted.
Parameters linkdict (dict) – the link to be deleted.
insert_link(linkdict)
PDF only: Insert a new link on this page. The parameter must be a dictionary of format as provided by
get_links() (see below).
Parameters linkdict (dict) – the link to be inserted.
update_link(linkdict)
PDF only: Modify the specified link. The parameter must be a (modified) original item of
get_links() (see below). The reason for this is the dictionary’s “xref” key, which identifies the
PDF object to be changed.
Parameters linkdict (dict) – the link to be modified.
Warning: If updating / inserting a URI link ("kind": LINK_URI), please make sure to start the
value for the "uri" key with a disambiguating string like "http://", "https://", "file:/
/", "ftp://", "mailto:", etc. Otherwise – depending on your browser or other “consumer”
software – unexpected default assumptions may lead to unwanted behaviours.
get_label()
(New in v1.18.6)
PDF only: Return the label for the page.
Return type str
Returns the label string like “vii” for Roman numbering or “” if not defined.
get_links()
Retrieves all links of a page.
Return type list
Returns A list of dictionaries. For a description of the dictionary entries see below. Always
use this or the Page.links() method if you intend to make changes to the links of
a page.
links(kinds=None)
(New in version 1.16.4)
Return a generator over the page’s links. The results equal the entries of Page.get_links().
Parameters kinds (sequence) – a sequence of integers to down-select to one or more
link kinds. Default is all links. Example: kinds=(fitz.LINK_GOTO,) will only return
internal links.
Return type generator
Returns an entry of Page.get_links() for each iteration.
annots(types=None)
(New in version 1.16.4)
Return a generator over the page’s annotations.
Note: Parameters overlay, keep_proportion, rotate and oc have the same meaning as in Page.
show_pdf_page().
PDF only: Insert text into the specified rect_like rect. See Shape.insert_textbox().
draw_line(p1, p2, color=None, width=1, dashes=None, lineCap=0, lineJoin=0, overlay=True,
morph=None, stroke_opacity=1, fill_opacity=1, oc=0)
(Changed in v1.18.4)
PDF only: Draw a line from p1 to p2 (point_like s). See Shape.draw_line().
draw_zigzag(p1, p2, breadth=2, color=None, width=1, dashes=None, lineCap=0, lineJoin=0, over-
lay=True, morph=None, stroke_opacity=1, fill_opacity=1, oc=0)
(Changed in v1.18.4)
PDF only: Draw a zigzag line from p1 to p2 (point_like s). See Shape.draw_zigzag().
draw_squiggle(p1, p2, breadth=2, color=None, width=1, dashes=None, lineCap=0, lineJoin=0,
overlay=True, morph=None, stroke_opacity=1, fill_opacity=1, oc=0)
(Changed in v1.18.4)
PDF only: Draw a squiggly (wavy, undulated) line from p1 to p2 (point_like s). See Shape.
draw_squiggle().
draw_circle(center, radius, color=None, fill=None, width=1, dashes=None, lineCap=0, line-
Join=0, overlay=True, morph=None, stroke_opacity=1, fill_opacity=1, oc=0)
(Changed in v1.18.4)
PDF only: Draw a circle around center (point_like) with a radius of radius. See Shape.
draw_circle().
draw_oval(quad, color=None, fill=None, width=1, dashes=None, lineCap=0, lineJoin=0, over-
lay=True, morph=None, stroke_opacity=1, fill_opacity=1, oc=0)
(Changed in v1.18.4)
PDF only: Draw an oval (ellipse) within the given rect_like or quad_like. See Shape.
draw_oval().
draw_sector(center, point, angle, color=None, fill=None, width=1, dashes=None, lineCap=0,
lineJoin=0, fullSector=True, overlay=True, closePath=False, morph=None,
stroke_opacity=1, fill_opacity=1, oc=0)
(Changed in v1.18.4)
PDF only: Draw a circular sector, optionally connecting the arc to the circle’s center (like a piece of pie).
See Shape.draw_sector().
draw_polyline(points, color=None, fill=None, width=1, dashes=None, lineCap=0, lineJoin=0,
overlay=True, closePath=False, morph=None, stroke_opacity=1, fill_opacity=1,
oc=0)
(Changed in v1.18.4)
PDF only: Draw several connected lines defined by a sequence of point_like s. See Shape.
draw_polyline().
draw_bezier(p1, p2, p3, p4, color=None, fill=None, width=1, dashes=None, lineCap=0, lineJoin=0,
overlay=True, closePath=False, morph=None, stroke_opacity=1, fill_opacity=1,
oc=0)
(Changed in v1.18.4)
PDF only: Draw a cubic Bézier curve from p1 to p4 with the control points p2 and p3 (all are
point_like s). See Shape.draw_bezier().
draw_curve(p1, p2, p3, color=None, fill=None, width=1, dashes=None, lineCap=0, lineJoin=0,
overlay=True, closePath=False, morph=None, stroke_opacity=1, fill_opacity=1, oc=0)
(Changed in v1.18.4)
PDF only: This is a special case of draw_bezier(). See Shape.draw_curve().
Note: An efficient way to background-color a PDF page with the old Python paper color is
Note: A reserved fontname can be specified in any mixture of upper or lower case and still match the
right built-in font definition: fontnames “helv”, “Helv”, “HELV”, “Helvetica”, etc. all lead to the same
font definition “Helvetica”. But from a Page perspective, these are different references. You can exploit
this fact when using different encoding variants (Latin, Greek, Cyrillic) of the same font on a page.
Parameters
• fontfile (str) – a path to a font file. If used, fontname must be different from
all reserved names.
• fontbuffer (bytes/bytearray) – the memory image of a font file. If used,
fontname must be different from all reserved names. This parameter would typ-
ically be used with Font.buffer for fonts supported / available via Font.
• set_simple (int) – applicable for fontfile / fontbuffer cases only: enforce treat-
ment as a “simple” font, i.e. one that only uses character codes up to 255.
• encoding (int) – applicable for the “Helvetica”, “Courier” and “Times” sets of
Base14_Fonts only. Select one of the available encodings Latin (0), Cyrillic (2)
or Greek (1). Only use the default (0 = Latin) for “Symbol” and “ZapfDingBats”.
Rytpe int
Returns the xref of the installed font.
Note: Built-in fonts will not lead to the inclusion of a font file. So the resulting PDF file will remain
small. However, your PDF viewer software is responsible for generating an appropriate appearance – and
there exist differences on whether or how each one of them does this. This is especially true for the CJK
fonts. But also Symbol and ZapfDingbats are incorrectly handled in some cases. Following are the Font
Names and their correspondingly installed Base Font names:
Base-14 Fonts1
Parameters
1 If your existing code already uses the installed base name as a font reference (as it was supported by PyMuPDF versions earlier than 1.14),
this will continue to work.
3 Not all PDF readers display these fonts at all. Some others do, but use a wrong character spacing, etc.
2 Not all PDF reader software (including internet browsers and office software) display all of these fonts. And if they do, the difference between
the serifed and the non-serifed version may hardly be noticable. But serifed and non-serifed versions lead to different installed base fonts, thus
providing an option to be displayable with your specific PDF viewer.
• rect (rect_like) – where to put the image. Must be finite and not empty.
(Changed in v1.17.6) No longer needs to have a non-empty intersection with the
page’s Page.cropbox 5 .
(Changed in version 1.14.13) The image is now always placed centered in the
rectangle, i.e. the centers of image and rectangle are equal.
• filename (str) – name of an image file (all formats supported by MuPDF – see
Supported Input Image Formats).
• stream (bytes,bytearray,io.BytesIO) – image in memory (all formats
supported by MuPDF – see Supported Input Image Formats).
Changed in version 1.14.13: io.BytesIO is now also supported.
• pixmap (Pixmap) – a pixmap containing the image.
• mask (bytes,bytearray,io.BytesIO) – (new in version v1.18.1) image
in memory – to be used as image mask (alpha values) for the base image. When
specified, the base image must be provided as a filename or a stream – and must
not be an image that already has a mask.
• xref (int) – (New in v1.18.13) the xref of an image already present in the
PDF. If given, parameters filename, pixmap, stream, alpha and mask are
ignored. The page will simply receive a reference to the exsting image.
• alpha (int) – (Changed in v1.19.3) deprecated. No longer needed – ignored
when given.
• rotate (int) – (new in version v1.14.11) rotate the image. Must be an integer
multiple of 90 degrees. If you need a rotation by an arbitrary angle, consider con-
verting the image to a PDF (Document.convert_to_pdf()) first and then
use Page.show_pdf_page() instead.
• oc (int) – (new in v1.18.3) (xref) make image visibility dependent on this OCG
or OCMD. Ignored after the first of multiple insertions. The property is stored
with the generated PDF image object and therefore controls the image’s visibil-
ity throughout the PDF.
• keep_proportion (bool) – (new in version v1.14.11) maintain the aspect ra-
tio of the image.
being invisible or only partially visible if the cropbox (representing the visible page part) is smaller.
Note:
1. The method detects multiple insertions of the same image (like in above example) and will store its
data only on the first execution. This is even true, if using the default xref=0.
2. The method cannot detect if the same image had already been part of the file before opening it.
3. You can use this method to provide a background or foreground image for the page, like a copyright
or a watermark. Please remember, that watermarks require a transparent image if put in foreground
...
4. The image may be inserted uncompressed, e.g. if a Pixmap is used or if the image has an alpha
channel. Therefore, consider using deflate=True when saving the file. In addition, there exist
effective ways to control the image size – even if transparency comes into play. Have a look at this
section of the documentation.
5. The image is stored in the PDF in its original quality. This may be much better than what you
ever need for your display. Consider decreasing the image size before insertion – e.g. by using
the pixmap option and then shrinking it or scaling it down (see Pixmap chapter). The PIL method
Image.thumbnail() can also be used for that purpose. The file size savings can be very significant.
6. Another efficient way to display the same image on multiple pages is another method:
show_pdf_page(). Consult Document.convert_to_pdf() for how to obtain intermedi-
ary PDFs usable for that method. Demo script fitz-logo.py implements a fairly complete approach.
Parameters
• opt (str) – A string indicating the requested format, one of the above. A mixture
of upper and lower case is supported.
Changed in version 1.16.3 Values “words” and “blocks” are now also accepted.
• clip (rect-like) – (new in v1.17.7) restrict extracted text to this rectangle. If
None, the full page is taken. Has no effect for options “html”, “xhtml” and “xml”.
• flags (int) – (new in version 1.16.2) indicator bits to control whether to in-
clude images or how text should be handled with respect to white spaces and
ligatures. See Text Extraction Flags for available indicators and Text Extrac-
tion Flags Defaults for default settings.
• textpage – (new in v1.19.0) use a previously created TextPage. This reduces
execution time very significantly: by more than 50% and up to 95%, depending
on the extraction option. If specified, the ‘flags’ and ‘clip’ arguments are ignored,
because they are textpage only properties. If omitted, a new, temporary textpage
will be created.
• sort (bool) – (new in v1.19.1) sort the output by vertical, then horizontal coordi-
nates. In many cases, this should suffice to generate a “natural” reading order. Has
no effect on (X)HTML and XML. Output option “words” sorts by (y1, x0) of
the words’ bboxes. Similar is true for “blocks”, “dict”, “json”, “rawdict”, “rawj-
son”: they all are sorted by (y1, x0) of the resp. block bbox. If specified for
“text”, then internally “blocks” is used.
Return type str, list, dict
Returns The page’s content as a string, a list or a dictionary. Refer to the corresponding
TextPage method for details.
Note:
1. You can use this method as a document conversion tool from any supported document type (not
only PDF!) to one of TEXT, HTML, XHTML or XML documents.
2. The inclusion of text via the clip parameter is decided on a by-character level: (changed in v1.18.2)
a character becomes part of the output, if its bbox is contained in clip. This deviates from the
algorithm used in redaction annotations: a character will be removed if its bbox intersects any
redaction annotation.
get_textbox(rect, textpage=None)
• New in v1.17.7
• Changed in v1.19.0: add textpage parameter
Retrieve the text contained in a rectangle.
Parameters
• rect (rect-like) – rect-like.
• textpage – a TextPage to use. If omitted, a new, temporary textpage will be
created.
Returns
a string with interspersed linebreaks where necessary. Changed in v1.19.0: It is based
on dedicated code. A tyical use is checking the result of Page.search_for():
>>> rl = page.search_for("currency:")
>>> page.get_textbox(rl[0])
'Currency:'
>>>
get_textpage(clip=None, flags=3)
(New in version 1.16.5)
Create a TextPage for the page.
Parameters
• flags (in) – indicator bits controlling the content available for subsequent text
extractions and searches – see the parameter of Page.get_text().
• clip (rect-like) – (new in v1.17.7) restrict extracted text to this area.
Returns TextPage
get_textpage_ocr(flags=3, language="eng", dpi=72, full=False)
• New in v.1.19.0
• Changed in v1.19.1: support full and partial OCRing a page.
Create a TextPage for the page that includes OCRed text. MuPDF will invoke Tesseract-OCR if this
method is used. Otherwise this is a normal TextPage object.
Parameters
• flags (in) – indicator bits controlling the content available for subsequent test
extractions and searches – see the parameter of Page.get_text().
• language (str) – the expected language(s). Use “+”-separated values if multi-
ple languages are expected, “eng+spa” for English and Spanish.
• dpi (int) – the desired resolution in dots per inch. Influences recognition quality
(and execution time).
• full (bool) – whether to OCR the full page, or just the displayed images.
Note: This method does not support a clip parameter – OCR will always happen for the complete page
rectangle.
Returns
a TextPage. Excution may be significantly longer than Page.get_textpage().
For a full page OCR, all text will have the font “GlyphlessFont” from Tesseract. In case
of partial OCR, normal text will keep its properties, and only text coming from images
will have the GlyphlessFont.
Note: OCRed text is only available to PyMuPDF’s text extractions and searches if
their textpage parameter specifies the output of this method.
This Jupyter notebook walks through an example for using OCR textpages.
get_drawings()
• New in v1.18.0
• Changed in v1.18.17
• Changed in v1.19.0: add “seqno” key, remove “clippings” key
• Changed in v1.19.1: “color” / “fill” keys now always are either are RGB tuples or None. This
resolves issues caused by exotic colorspaces.
• Changed in v1.19.2: add an indicator for the “orientation” of the area covered by an “re” item.
Return the draw commands of the page. These are instructions which draw lines, rectangles, quadruples
or curves, including properties like colors, transparency, line width and dashing, etc.
Returns a list of dictionaries. Each dictionary item contains one or more single draw com-
mands belonging together: they have the same properties (colors, dashing, etc.). This is
called a “path” in PDF, but the method works for all document types.
The path dictionary has been designed to be compatible with class Shape. There are the following keys:
Key Value
closePath Same as the parameter in Shape.
color Stroke color (see Shape).
dashes Dashed line specification (see Shape).
even_odd Fill colors of area overlaps – same as the parameter in Shape.
fill Fill color (see Shape).
items List of draw commands: lines, rectangles, quads or curves.
lineCap Number 3-tuple, use its max value on output with Shape.
lineJoin Same as the parameter in Shape.
fill_opacity (new in v1.18.17) fill color transparency (see Shape).
stroke_opacity (new in v1.18.17) stroke color transparency (see Shape).
rect Page area covered by this path. Information only.
seqno (new in v1.19.0) command number when building page appearance
type (new in v1.18.17) type of this path.
width Stroke line width (see Shape).
• (Changed in v1.18.17) Key "opacity" has been replaced by the new keys
"fill_opacity" and "stroke_opacity". This is now compatible with the
corresponding parameters of Shape.finish().
orientation which is 1 resp. -1 indicating whether the enclosed area is rotated left
(1 = anti-clockwise), or resp. right7 .
• ("qu", quad) - a Quad. New in v1.18.17, changed in v1.19.2: 3 or 4 consecutive
lines are detected to actually represent a Quad.
Note: Starting with v1.19.2, quads and rectangles are reliably recognized as such.
Using class Shape, you should be able to recreate the original drawings on a separate (PDF)
page with high fidelity, but see the following comments on restrictions. A coding draft can
be found in section “Extractings Drawings” of chapter Collection of Recipes.
Note:
• The visual appearance of a page may have been designed in a very complex way. For example
in PDF, layers (Optional Content Groups) can control the visibility of items (drawings and other
objects) depending on whatever condition: for example showing or suppressing a watermark de-
pending on the current output device (screen, paper, . . . ), or option-based inclusion / omission of
details in a technical document, and so on. Effects like these are ignored by the method – it will
unconditionally return all paths.
• When a viewer software builds a page’s appearance, it will sequentially walk through a list of
commands (in PDF, those are stored in the /Contents object), containing instructions like “draw
this path, show this image, paint this text, etc.”. The key "seqno" (new in v1.19.0) is the command
number, that draws this path. You can use it to determine if objects cover other objects on the
page. For example, the rectangle of a “fill” path will cover objects drawn earlier – i.e. having
a smaller "seqno" – if the rectangles overlap. Please also see Page.get_bboxlog() and
Page.get_texttrace().
Note: The method is now based on the output of Page.get_cdrawings() – which is faster, but
requires somewhat more attention processing its output.
get_cdrawings()
• New in v1.18.17
• Changed in v1.19.0: removed “clippings” key, added “seqno” key.
• Changed in v1.19.1: always generate RGB color tuples.
Extract the drawing paths on the page. Apart from following technical differences, functionally equivalent
to Page.get_drawings(), but much faster (factor 3 or more):
• Every path type only contains the relevant keys, e.g. a stroke path has no "fill" color key. See
comment in method Page.get_drawings().
• Coordinates are given as point_like, rect_like and quad_like tuples – not as Point,
Rect, Quad objects.
Note: If performance is a concern (e.g. because your page has tens of thousands of drawings), consider
using this method: Compared to versions earlier than 1.18.17, you should see much shorter response
7 In PDF, an area enclosed by some lines or curves can have a property called “orientation”. This is significant for switching on or off the fill
color of that area when there exist multiple area overlaps - see discussion in method Shape.finish() using the “non-zero winding number”
rule. While orientation of curves, quads, triangles and other shapes enclosed by lines always was detectable, this has been impossible for “re”
(rectangle) items in the past. Adding the orientation parameter now delivers the missing information.
times. We have seen pages that required 2 seconds then, now only need 200 ms with this method.
get_fonts(full=False)
PDF only: Return a list of fonts referenced by the page. Wrapper for Document.
get_page_fonts().
get_images(full=False)
PDF only: Return a list of images referenced by the page. Wrapper for Document.
get_page_images().
get_image_info(hashes=False, xrefs=False)
• New in v1.18.11
• Changed in v1.18.13: added image MD5 hashcode computation and xref search.
Return a list of meta information dictionaries for all images shown on the page. This works for all
document types. Technically, this is a subset of the dictionary output of Page.get_text(): the image
binary content and any text on the page are ignored.
Parameters
• hashes (bool) – New in v1.18.13: Compute the MD5 hashcode for each en-
countered image, which allows identifying image duplicates. This adds the key
"digest" to the output, whose value is a 16 byte bytes object.
• xrefs (bool) – New in v1.18.13: PDF only. Try to find the xref for each
image. Implies hashes=True. Adds the "xref" key to the dictionary. If not
found, the value is 0, which means, the image is either “inline” or otherwise unde-
tectable. Please note that this option has an extended response time, because the
MD5 hashcode will be computed at least two times for each image with an xref.
Return type list[dict]
Returns
A list of dictionaries. This includes information for exactly those images, that are
shown on the page – including “inline images”. In contrast to images included in
Page.get_text(), image binary content is not loaded, which drastically reduces
memory usage. The dictionary layout is similar to that of image blocks in page.
get_text("dict").
Key Value
number block number (int)
bbox image bbox on page, rect_like
width original image width (int)
height original image height (int)
cs-name colorspace name (str)
colorspace colorspace.n (int)
xres resolution in x-direction (int)
yres resolution in y-direction (int)
bpc bits per component (int)
size storage occupied by image (int)
digest MD5 hashcode (bytes), if hashes is true
xref image xref or 0, if xrefs is true
transform matrix transforming image rect to bbox, matrix_like
Multiple occurrences of the same image are always reported. You can detect duplicates
by comparing their digest values.
get_xobjects()
PDF only: Return a list of Form XObjects referenced by the page. Wrapper for Document.
get_page_xobjects().
get_image_rects(item, transform=False)
New in v1.18.13
PDF only: Return boundary boxes and transformation matrices of an embedded image. This is an im-
proved version of Page.get_image_bbox() with the following differences:
• There is no restriction on how the image is invoked (by the page or one of its Form XObjects). The
result is always complete and correct.
• The result is a list of Rect or (Rect, Matrix) objects – depending on transform. Each list item
represents one location of the image on the page. Multiple occurrences might not be detectable by
Page.get_image_bbox().
• The method invokes Page.get_image_info() with xrefs=True and therefore has a no-
ticeably longer response time than Page.get_image_bbox().
Parameters
• item (list,str,int) – an item of the list Page.get_images(), or the
reference name entry of such an item (item[7]), or the image xref.
• transform (bool) – also return the matrix used to transform the image rectan-
gle to the bbox on the page. If true, then tuples (bbox, matrix) are returned.
Return type list
Returns Boundary boxes and respective transformation matrices for each image occurrence
on the page. If the item is not on the page, an empty list [] is returned.
get_image_bbox(item, transform=False)
Changed in v1.18.11
PDF only: Return boundary box and transformation matrix of an embedded image.
Changed in v1.17.0:
• The page’s contents are no longer modified by this method.
Parameters
• item (list,str) – an item of the list Page.get_images() with full=True
specified, or the reference name entry of such an item, which is item[-3] (or item[7]
respectively).
• transform (bool) – (new in v1.18.11) also return the matrix used to transform
the image rectangle to the bbox on the page. Default is just the bbox. If true, then
a tuple (bbox, matrix) is returned.
Return type Rect or (Rect, Matrix)
Returns
the boundary box of the image – optionally also its transformation matrix.
• (Changed in v1.16.7) – If the page in fact does not display this image, an infinite
rectangle is returned now. In previous versions, an exception was raised. Formally
invalid parameters still raise exceptions.
• (Changed in v1.17.0) – Only images referenced directly by the page are considered.
This means that images occurring in embedded PDF pages are ignored and an
exception is raised.
• (Changed in v1.18.5) – Removed the restriction introduced in v1.17.0: any item of
the page’s image list may be specified.
• (Changed in v1.18.11) – Partially re-instated a restriction: only those images are
considered, that are either directly referenced by the page or by a Form XObject
directly referenced by the page.
• (Changed in v1.18.11) – Optionally also return the transformation matrix together
with the bbox as the tuple (bbox, transform).
Note:
1. Be aware that Page.get_images() may contain “dead” entries i.e. images, which the page
does not display. This is no error, but intended by the PDF creator. No exception will be raised
in this case, but an infinite rectangle is returned. You can avoid this from happening by executing
Page.clean_contents() before this method.
2. The image’s “transformation matrix” is defined as the matrix, for which the expression bbox /
transform == fitz.Rect(0, 0, 1, 1) is true, lookup details here: Image Transforma-
tion Matrix.
get_svg_image(matrix=fitz.Identity, text_as_path=True)
Create an SVG image from the page. Only full page images are currently supported.
Parameters
• matrix (matrix_like) – a matrix, default is Identity.
• text_as_path (bool) – (new in v1.17.5) – controls how text is represented.
True outputs each character as a series of elementary draw commands, which leads
to a more precise text display in browsers, but a very much larger output for text-
oriented pages. Display quality for False relies on the presence of the referenced
fonts on the current system. For missing fonts, the internet browser will fall back
to some default – leading to unpleasant appearances. Choose False if you want to
parse the text of the SVG.
Returns a UTF-8 encoded string that contains the image. Because SVG has XML syntax it
can be saved in a text file with extension .svg.
• dpi (int) – (new in v1.19.2) desired resolution in x and y direction. If not None,
the "matrix" parameter is ignored.
• colorspace (str or Colorspace) – The desired colorspace, one of “GRAY”,
“RGB” or “CMYK” (case insensitive). Or specify a Colorspace, ie. one of the
predefined ones: csGRAY, csRGB or csCMYK.
• clip (irect_like) – restrict rendering to the intersection of this area with the
page’s rectangle.
• alpha (bool) – whether to add an alpha channel. Always accept the default False
if you do not really need transparency. This will save a lot of memory (25% in case
of RGB . . . and pixmaps are typically large!), and also processing time. Also
note an important difference in how the image will be rendered: with True the
pixmap’s samples area will be pre-cleared with 0x00. This results in transparent
areas where the page is empty. With False the pixmap’s samples will be pre-cleared
with 0xff. This results in white where the page has nothing to show.
Changed in version 1.14.17 The default alpha value is now False.
– Generated with alpha=True
Note: The method will respect any page rotation and will not exceed the intersection of clip and
Page.cropbox. If you need the page’s mediabox (and if this is a different rectangle), you can use a
snippet like the following to achieve this:
annot_names()
(New in version 1.16.10)
PDF only: return a list of the names of annotations, widgets and links. Technically, these are the /NM
values of every PDF object found in the page’s /Annots array.
Return type list
annot_xrefs()
(New in version 1.17.1)
PDF only: return a list of the :data‘xref‘ numbers of annotations, widgets and links – technically of all
entries found in the page’s /Annots array.
Return type list
Returns a list of items (xref, type) where type is the annotation type. Use the type to tell
apart links, fields and annotations, see Annotation Types.
load_annot(ident)
(New in version 1.17.1)
PDF only: return the annotation identified by ident. This may be its unique name (PDF /NM key), or its
xref.
Parameters ident (str,int) – the annotation name or xref.
Return type Annot
Returns the annotation or None.
load_links()
Return the first link on a page. Synonym of property first_link.
Return type Link
Returns first link on the page (or None).
set_rotation(rotate)
PDF only: Sets the rotation of the page.
Parameters rotate (int) – An integer specifying the required rotation in degrees. Must
be an integer multiple of 90. Values will be converted to one of 0, 90, 180, 270.
show_pdf_page(rect, docsrc, pno=0, keep_proportion=True, overlay=True, oc=0, rotate=0,
clip=None)
PDF only: Display a page of another PDF as a vector image (otherwise similar to Page.
insert_image()). This is a multi-purpose method. For example, you can use it to
• create “n-up” versions of existing PDF files, combining several input pages into one output page
(see example 4-up.py),
• create “posterized” PDF files, i.e. every input page is split up in parts which each create a separate
output page (see posterize.py),
• include PDF-based vector images like company logos, watermarks, etc., see svg-logo.py, which
puts an SVG-based logo on each page (requires additional packages to deal with SVG-to-PDF
conversions).
Parameters
• rect (rect_like) – where to place the image on current page. Must be finite
and its intersection with the page must not be empty.
Changed in version 1.14.11 Position the source rectangle centered in this rectan-
gle.
• docsrc (Document) – source PDF document containing the page. Must be a
different document object, but may be the same file.
• pno (int) – page number (0-based, in -∞ < pno < docsrc.
page_count) to be shown.
• keep_proportion (bool) – whether to maintain the width-height-ratio (de-
fault). If false, all 4 corners are always positioned on the border of the target
rectangle – whatever the rotation value. In general, this will deliver distorted and
/or non-rectangular images.
• overlay (bool) – put image in foreground (default) or background.
• oc (int) – (new in v1.18.3) (xref) make visibility dependent on this OCG (op-
tional content group).
• rotate (float) – (new in version 1.14.10) show the source rectangle rotated by
some angle. Changed in version 1.14.11: Any angle is now supported.
• clip (rect_like) – choose which part of the source page to show. Default is
the full page, else must be finite and its intersection with the source page must not
be empty.
Note: In contrast to method Document.insert_pdf(), this method does not copy annotations,
widgets or links, so these are not included in the target6 . But all its other resources (text, images,
fonts, etc.) will be imported into the current PDF. They will therefore appear in text extractions and in
6 If you need to also see annotations or fields in the target page, you can try and convert the source PDF to another PDF using Document.
convert_to_pdf(). The underlying MuPDF function of that method will convert these objects to normal page content. Then use Page.
show_pdf_page() with the converted PDF page.
get_fonts() and get_images() lists – even if they are not contained in the visible area given by
clip.
Example: Show the same source page, rotated by 90 and by -90 degrees:
new_shape()
PDF only: Create a new Shape object for the page.
Return type Shape
Returns a new Shape to use for compound drawings. See description there.
search_for(needle, clip=clip, quads=False, flags=TEXT_DEHYPHENATE |
TEXT_PRESERVE_WHITESPACE | TEXT_PRESERVE_LIGATURES, textpage=None)
• Changed in v1.18.2
Note: The method supports multi-line text marker annotations: you can use the full returned list as one
single parameter for creating the annotation.
Caution:
• There is a tricky aspect: the search logic regards contiguous multiple occurrences of needle
as one: assuming needle is “abc”, and the page contains “abc” and “abcabc”, then only two
rectangles will be returned, one for “abc”, and a second one for “abcabc”.
• You can always use Page.get_textbox() to check what text actually is being surrounded
by each rectangle.
Note: A feature repeatedly asked for is supporting regular expressions when specifying the "needle"
string: There is no way to do this. If you need something in that direction, first extract text in a suitable
format and then subselect the result by matching its text portions with some regex pattern:
The matches list will contain the words matching the regex pattern.
set_mediabox(r)
PDF only: (New in v1.16.13) Change the physical page dimension by setting mediabox in the page’s
object definition.
Parameters r (rect-like) – the new mediabox value.
Note: This method also sets the page’s cropbox to the same value – to prevent mismatches caused by
values further up in the parent hierarchy.
Caution: For non-empty pages this may have undesired effects, because content depends on this
value and will change position or even disappear.
set_cropbox(r)
PDF only: change the visible part of the page.
Parameters r (rect_like) – the new visible area of the page. Note that this must be
specified in unrotated coordinates.
After execution if the page is not rotated, Page.rect will equal this rectangle, but shifted to the top-left
position (0, 0) if necessary. Example session:
rotation
Contains the rotation of the page in degrees (always 0 for non-PDF types).
Type int
cropbox_position
Contains the top-left point of the page’s /CropBox for a PDF, otherwise Point(0, 0).
Type Point
cropbox
The page’s /CropBox for a PDF. Always the unrotated page rectangle is returned. For a non-PDF this
will always equal the page rectangle.
Note: In PDF, the relationship between /MediaBox, /CropBox and page rectangle may sometimes
be confusing, please do lookup the glossary for MediaBox.
Type Rect
mediabox_size
Contains the width and height of the page’s Page.mediabox for a PDF, otherwise the bottom-right
coordinates of Page.rect.
Type Point
mediabox
The page’s mediabox for a PDF, otherwise Page.rect.
Type Rect
Note: For most PDF documents and for all other document types, page.rect == page.cropbox ==
page.mediabox is true. However, for some PDFs the visible page is a true subset of mediabox. Also, if
the page is rotated, its Page.rect may not equal Page.cropbox. In these cases the above attributes
help to correctly locate page elements.
transformation_matrix
This matrix translates coordinates from the PDF space to the MuPDF space. For example, in PDF /
Rect [x0 y0 x1 y1] the pair (x0, y0) specifies the bottom-left point of the rectangle – in contrast
to MuPDF’s system, where (x0, y0) specify top-left. Multiplying the PDF coordinates with this matrix
will deliver the (Py-) MuPDF rectangle version. Obviously, the inverse matrix will again yield the PDF
rectangle.
Type Matrix
rotation_matrix
derotation_matrix
These matrices may be used for dealing with rotated PDF pages. When adding / inserting anything to
a PDF page, the coordinates of the unrotated page are always used. These matrices help translating
between the two states. Example: if a page is rotated by 90 degrees – what would then be the coordinates
of the top-left Point(0, 0) of an A4 page?
Type Matrix
first_link
Contains the first Link of a page (or None).
Type Link
first_annot
Contains the first Annot of a page (or None).
Type Annot
first_widget
Contains the first Widget of a page (or None).
Type Widget
number
The page number.
Type int
parent
The owning document object.
Type Document
rect
Contains the rectangle of the page. Same as result of Page.bound().
Type Rect
xref
The page’s PDF xref. Zero if not a PDF.
Type Rect
Each entry of the Page.get_links() list is a dictionay with the following keys:
• kind: (required) an integer indicating the kind of link. This is one of LINK_NONE, LINK_GOTO,
LINK_GOTOR, LINK_LAUNCH, or LINK_URI. For values and meaning of these names refer to Link Desti-
nation Kinds.
• from: (required) a Rect describing the “hot spot” location on the page’s visible representation (where the cursor
changes to a hand image, usually).
• page: a 0-based integer indicating the destination page. Required for LINK_GOTO and LINK_GOTOR, else
ignored.
• to: either a fitz.Point, specifying the destination location on the provided page, default is fitz.Point(0, 0), or a
symbolic (indirect) name. If an indirect name is specified, page = -1 is required and the name must be defined
in the PDF in order for this to work. Required for LINK_GOTO and LINK_GOTOR, else ignored.
• file: a string specifying the destination file. Required for LINK_GOTOR and LINK_LAUNCH, else ignored.
• uri: a string specifying the destination internet resource. Required for LINK_URI, else ignored. You should
make sure to start this string with an unambiguous substring, that classifies the subtype of the URL, like
"http://", "https://", "file://", "ftp://", "mailto:", etc. Otherwise your browser will try
to interpret the text and come to unwanted / unexpected conclusions about the intended URL type.
• xref : an integer specifying the PDF xref of the link object. Do not change this entry in any way. Required for
link deletion and update, otherwise ignored. For non-PDF documents, this entry contains -1. It is also -1 for all
entries in the get_links() list, if any of the links is not supported by MuPDF - see the note below.
MuPDF’s support for links has changed in v1.10a. These changes affect link types LINK_GOTO and LINK_GOTOR.
6.12.3.1 Reading (pertains to method get_links() and the first_link property chain)
If MuPDF detects a link to another file, it will supply either a LINK_GOTOR or a LINK_LAUNCH link kind. In case
of LINK_GOTOR destination details may either be given as page number (eventually including position information),
or as an indirect destination.
If an indirect destination is given, then this is indicated by page = -1, and link.dest.dest will contain this name. The
dictionaries in the get_links() list will contain this information as the to value.
Internal links are always of kind LINK_GOTO. If an internal link specifies an indirect destination, it will always
be resolved and the resulting direct destination will be returned. Names are never returned for internal links, and
undefined destinations will cause the link to be ignored.
6.12.3.2 Writing
PyMuPDF writes (updates, inserts) links by constructing and writing the appropriate PDF object source. This makes
it possible to specify indirect destinations for LINK_GOTOR and LINK_GOTO link kinds (pre PDF 1.2 file formats
are not supported).
Warning: If a LINK_GOTO indirect destination specifies an undefined name, this link can later on not be found /
read again with MuPDF / PyMuPDF. Other readers however will detect it, but flag it as erroneous.
Indirect LINK_GOTOR destinations can in general of course not be checked for validity and are therefore always
accepted.
This is an overview of homologous methods on the Document and on the Page level.
The page number “pno” is a 0-based integer -∞ < pno < page_count.
Note: Most document methods (left column) exist for convenience reasons, and are just wrappers for: Docu-
ment[pno].<page method>. So they load and discard the page on each execution.
However, the first two methods work differently. They only need a page’s object definition statement - the page
itself will not be loaded. So e.g. Page.get_fonts() is a wrapper the other way round and defined as follows:
page.get_fonts == page.parent.get_page_fonts(page.number).
6.13 Pixmap
Pixmaps (“pixel maps”) are objects at the heart of MuPDF’s rendering capabilities. They represent plane rectangular
sets of pixels. Each pixel is described by a number of bytes (“components”) defining its color, plus an optional alpha
byte defining its transparency.
In PyMuPDF, there exist several ways to create a pixmap. Except the first one, all of them are available as overloaded
constructors. A pixmap can be created . . .
1. from a document page (method Page.get_pixmap())
2. empty, based on Colorspace and IRect information
3. from a file
4. from an in-memory image
5. from a memory area of plain pixels
6. from an image inside a PDF document
7. as a copy of another pixmap
Note: A number of image formats is supported as input for points 3. and 4. above. See section Supported Input
Image Formats.
Have a look at the Collection of Recipes section to see some pixmap usage “at work”.
Class API
class Pixmap
Copy and add image mask: Copy source pixmap, add an alpha channel with transparency data from a
mask pixmap.
Parameters
• source (Pixmap) – pixmap without alpha channel.
• mask (Pixmap) – a mask pixmap. Must be a graysale pixmap.
__init__(self, source, width, height[, clip ])
Copy and scale: Copy source pixmap, scaling new width and height values – the image will appear
stretched or shrunk accordingly. Supports partial copying. The source colorspace may be None.
Parameters
• source (Pixmap) – the source pixmap.
• width (float) – desired target width.
• height (float) – desired target height.
• clip (irect_like) – restrict the resulting pixmap to this region of the scaled
pixmap.
Note: If width or height do not represent integers (i.e. value.is_integer() != True), then the
resulting pixmap will have an alpha channel.
Note: A typical use includes separation of color and transparency bytes in separate pixmaps. Some
applications require this like e.g. wx.Bitmap.FromBufferAndAlpha() of wxPython:
__init__(self, filename)
From a file: Create a pixmap from filename. All properties are inferred from the input. The origin of the
resulting pixmap is (0, 0).
Parameters filename (str) – Path of the image file.
__init__(self, stream)
From memory: Create a pixmap from a memory area. All properties are inferred from the input. The
origin of the resulting pixmap is (0, 0).
Parameters stream (bytes,bytearray,BytesIO) – Data containing a complete,
valid image. Could have been created by e.g. stream = bytearray(open(‘image.file’,
‘rb’).read()). Type bytes is supported in Python 3 only, because bytes == str in Python
2 and the method will interpret the stream as a filename.
Changed in version 1.14.13: io.BytesIO is now also supported.
__init__(self, colorspace, width, height, samples, alpha)
From plain pixels: Create a pixmap from samples. Each pixel must be represented by a number of bytes
as controlled by the colorspace and alpha parameters. The origin of the resulting pixmap is (0, 0). This
method is useful when raw image data are provided by some other program – see Collection of Recipes.
Parameters
• colorspace (Colorspace) – Colorspace of image.
• width (int) – image width
• height (int) – image height
• samples (bytes,bytearray,BytesIO) – an area containing all pixels of
the image. Must include alpha values if specified.
Changed in version 1.14.13: (1) io.BytesIO can now also be used. (2) Data are now
copied to the pixmap, so may safely be deleted or become unavailable.
• alpha (bool) – whether a transparency channel is included.
Note:
1. The following equation must be true: (colorspace.n + alpha) * width * height == len(samples).
2. Starting with version 1.14.13, the samples data are copied to the pixmap.
Parameters
• red (int) – red component.
• green (int) – green component.
• blue (int) – blue component.
gamma_with(gamma)
Apply a gamma factor to a pixmap, i.e. lighten or darken it. Pixmaps with colorspace None are ignored
with a warning.
Parameters gamma (float) – gamma = 1.0 does nothing, gamma < 1.0 lightens, gamma
> 1.0 darkens the image.
shrink(n)
Shrink the pixmap by dividing both, its width and height by 2n .
Parameters n (int) – determines the new pixmap (samples) size. For example, a value of
2 divides width and height by 4 and thus results in a size of one 16th of the original.
Values less than 1 are ignored with a warning.
Note: Use this methods to reduce a pixmap’s size retaining its proportion. The pixmap is changed “in
place”. If you want to keep original and also have more granular choices, use the resp. copy constructor
above.
pixel(x, y)
New in version:: 1.14.5: Return the value of the pixel at location (x, y) (column, line).
Parameters
• x (int) – the column number of the pixel. Must be in range(pix.width).
• y (int) – the line number of the pixel, Must be in range(pix.height).
Return type list
Returns a list of color values and, potentially the alpha value. Its length and content depend
on the pixmap’s colorspace and the presence of an alpha. For RGBA pixmaps the result
would e.g. be [r, g, b, a]. All items are integers in range(256).
set_pixel(x, y, color)
New in version 1.14.7: Manipulate the pixel at location (x, y) (column, line).
Parameters
• x (int) – the column number of the pixel. Must be in range(pix.width).
• y (int) – the line number of the pixel. Must be in range(pix.height).
• color (sequence) – the desired pixel value given as a sequence of integers in
range(256). The length of the sequence must equal Pixmap.n, which in-
cludes any alpha byte.
set_rect(irect, color)
New in version 1.14.8: Set the pixels of a rectangle to a value.
Parameters
• irect (irect_like) – the rectangle to be filled with the value. The actual
area is the intersection of this parameter and Pixmap.irect. For an empty
intersection (or an invalid parameter), no change will happen.
Note:
1. This method is equivalent to Pixmap.set_pixel() executed for each pixel in the rectangle,
but is obviously very much faster if many pixels are involved.
2. This method can be used similar to Pixmap.clear_with() to initialize a pixmap with a certain
color like this: pix.set_rect(pix.irect, (255, 255, 0)) (RGB example, colors the complete pixmap with
yellow).
set_origin(x, y)
(New in v1.17.7) Set the x and y values of the pixmap’s top-left point.
Parameters
• x (int) – x coordinate
• y (int) – y coordinate
set_dpi(xres, yres)
(New in v1.16.17) Set the resolution (dpi) in x and y direction.
(Changed in v1.18.0) When saving as a PNG image, these values will be stored now.
Parameters
• xres (int) – resolution in x direction.
• yres (int) – resolution in y direction.
set_alpha(alphavalues, premultiply=1, opaque=None)
(Changed in v 1.18.13)
Change the alpha values. The pixmap must have an alpha channel.
Parameters
• alphavalues (bytes,bytearray,BytesIO) – the new alpha values. If
provided, its length must be at least width * height. If omitted (None), all alpha
values are set to 255 (no transparency). Changed in version 1.14.13: io.BytesIO is
now also accepted.
• premultiply (bool) – New in v1.18.13: whether to premultiply color compo-
nents with the alpha value.
• opaque (list,tuple) – ignore the alpha value and set this color to fully trans-
parent. A sequence of integers in range(256) with a length of Pixmap.n. De-
fault is None. For example, a typical choice for RGB would be opaque=(255,
255, 255) (white).
invert_irect([irect ])
Invert the color of all pixels in IRect irect. Will have no effect if colorspace is None.
Parameters irect (irect_like) – The area to be inverted. Omit to invert everything.
copy(source, irect)
Copy the irect part of the source pixmap into the corresponding area of this one. The two pixmaps may
have different dimensions and can each have CS_GRAY or CS_RGB colorspaces, but they currently must
have the same alpha property2 . The copy mechanism automatically adjusts discrepancies between source
and target like so:
If copying from CS_GRAY to CS_RGB, the source gray-shade value will be put into each of the three rgb
component bytes. If the other way round, (r + g + b) / 3 will be taken as the gray-shade value of the
target.
Between irect and the target pixmap’s rectangle, an “intersection” is calculated at first. This takes into
account the rectangle coordinates and the current attribute values Pixmap.x and Pixmap.y (which
you are free to modify for this purpose via Pixmap.set_origin()). Then the corresponding data of
this intersection are copied. If the intersection is empty, nothing will happen.
Parameters
• source (Pixmap) – source pixmap.
• irect (irect_like) – The area to be copied.
Note: Example: Suppose you have two pixmaps, pix1 and pix2 and you want to copy the lower right
quarter of pix2 to pix1 such that it starts at the top-left point of pix1. Use the following snippet:
save(filename, output=None)
Save pixmap as an image file. Depending on the output chosen, only some or all colorspaces are supported
and different file extensions can be chosen. Please see the table below. Since MuPDF v1.10a the savealpha
option is no longer supported and will be silently ignored.
Parameters
• filename (str,Path,file) – The file to save to. May be provided as a
string, as a pathlib.Path or as a Python file object. In the latter two cases, the
2 To also set the alpha property, add an additional step to this method by dropping or adding an alpha channel to the result.
filename is taken from the resp. object. The filename’s extension determines the
image format, which can be overruled by the output parameter.
• output (str) – The requested image format. The default is the filename’s exten-
sion. If not recognized, png is assumed. For other possible values see Supported
Output Image Formats.
pdfocr_save(filename, compress=True, language="eng")
• New in v1.19.0
Perform text recognition using Tesseract and save the image as a 1-page PDF with an OCR text layer.
Parameters
• filename (str,fp) – identifies the file to save to. May be either a string or a
pointer to a file opened with “wb” (includes io.BytesIO() objects).
• compress (bool) – whether to compress the resulting PDF, default is True.
• language (str) – the languages occurring in the image. This must be specified
in Tesseract format. Default is “eng” for English. Use “+”-separated Tesseract
language codes for multiple languages, like “eng+spa” for English and Spanish.
Note: Will fail if Tesseract is not installed or if the environment variable “TESSDATA_PREFIX” is not
set to the tessdata folder name. This is what you would typically see on a Windows platform:
>>> print(os.environ["TESSDATA_PREFIX"])
C:\Program Files\Tesseract-OCR\tessdata
>>> print(os.environ["TESSDATA_PREFIX"])
/usr/share/tesseract-ocr/4.00/tessdata
pdfocr_tobytes(compress=True, language="eng")
• New in v1.19.0
Perform text recognition using Tesseract and convert the image to a 1-page PDF with an OCR text layer.
Internally invokes Pixmap.pdfocr_save().
Returns
A 1-page PDF file in memory. Could be opened like doc=fitz.open("pdf",
pix.pdfocr_tobytes()), and text extractions could be performed on its
page=doc[0].
Note: Another possible use is insertion into some pdf. The following snippet reads
the images of a folder and stores them as pages in a new PDF that contain an OCR text
layer:
doc = fitz.open()
for imgfile in os.listdir(folder):
pix = fitz.Pixmap(imgfile)
imgpdf = fitz.open("pdf", pix.pdfocr_tobytes())
doc.insert_pdf(imgpdf)
pix = None
(continues on next page)
tobytes(output="png")
New in version 1.14.5: Return the pixmap as a bytes memory object of the specified format – similar to
save().
Parameters output (str) – The requested image format. The default is “png” for which
this function equals tobytes(). For other possible values see Supported Output Im-
age Formats.
Return type bytes
pil_save(*args, **kwargs)
(New in v1.17.3)
Write the pixmap as an image file using Pillow. Use this method for output unsupported by MuPDF.
Examples are
• Formats JPEG, JPX, J2K, WebP, etc.
• Storing EXIF information.
• If you do not provide dpi information, the values xres, yres stored with the pixmap are automatically
used.
A simple example: pix.pil_save("some.jpg", optimize=True, dpi=(150, 150)).
For details on other parameters see the Pillow documentation.
Note: (Changed in v1.18.0) Pixmap.save() now also sets dpi from xres / yres automatically, when
saving a PNG image.
If Pillow is not installed an ImportError exception is raised.
pil_tobytes(*args, **kwargs)
(New in v1.17.3)
Return an image as a bytes object in the specified format using Pillow. For example stream = pix.
pil_tobytes(format="JPEG", optimize=True). Also see above. For details on other pa-
rameters see the Pillow documentation. If Pillow is not installed, an ImportError exception is raised.
Return type bytes
warp(quad, width, height)
• New in v1.19.3
Return a new pixmap by “warping” the quad such that the quad corners become the new pixmap’s corners.
The target pixmap’s irect will be (0, 0, width, height).
Parameters
• quad (quad_like) – a convex quad with coordinates inside Pixmap.irect
(including the border points).
• width (int) – desired resulting width.
• height (int) – desired resulting height.
Returns A new pixmap where the quad corners are mapped to the pixmap corners in a
clockwise fashion: quad.ul -> irect.tl, quad.ur -> irect.tr, etc.
Return type
Pixmap
color_count(colors=False, clip=None)
• New in v1.19.2
• Changed in v1.19.3
Determine the pixmap’s unique colors and their count.
Parameters
• colors (bool) – (changed in v1.19.3) If True return a dictionary of color pixels
and their usage count, else just the number of unique colors.
• clip (rect_like) – a rectangle inside Pixmap.irect. If provided, only
those pixels are considered. This allows inspecting sub-rectangles of a given
pixmap directly – instead of building sub-pixmaps.
Return type dict or int
Returns
either the number of colors, or a dictionary with the items pixel: count. The
pixel key is a bytes object of length Pixmap.n.
>>> pix=fitz.Pixmap("sierpinski-carpet.png")
>>> colors = pix.color_count(True)
>>> print(colors)
{b'\xff\xef\xd5': 262144, b'\x00\x00\xff': 269297}
>>> [tuple(map(int, c)) for c in colors.keys()]
[(255, 239, 213), (0, 0, 255)]
• The response time depends on the pixmap’s samples size and may be more than a
second for very large pixmaps.
• Where applicable, pixels with different alpha values will be treated as different
colors.
color_topusage(clip=None)
• New in v1.19.3
Return the most frequently used color and its relative frequency.
Parameters clip (rect_like) – a rectangle inside Pixmap.irect. If provided, only
those pixels are considered. This allows inspecting sub-rectangles of a given pixmap
directly – instead of building sub-pixmaps.
Return type tuple[float, bytes]
Returns A tuple (ratio, pixel) where 0 < ratio <= 1 and pixel is the pixel
value of the color. Use this to decide if the image is “almost” unicolor: e.g. a response
(0.95, b"\x00\x00\x00") means that 95% of all pixels are black.
alpha
Indicates whether the pixmap contains transparency information.
Type bool
digest
The MD5 hashcode (16 bytes) of the pixmap. This is a technical value used for unique identifications.
Type bytes
colorspace
The colorspace of the pixmap. This value may be None if the image is to be treated as a so-called image
mask or stencil mask (currently happens for extracted PDF document images only).
Type Colorspace
stride
Contains the length of one row of image data in Pixmap.samples. This is primarily used for calcula-
tion purposes. The following expressions are true:
• len(samples) == height * stride
• width * n == stride
Type int
is_monochrome
• New in v1.19.2
Is True for a gray pixmap which only has the colors black and white.
Type bool
is_unicolor
• New in v1.19.2
Is True if all pixels are identical (any colorspace). Where applicable, pixels with different alpha values
will be treated as different colors.
Type bool
irect
Contains the IRect of the pixmap.
Type IRect
samples
The color and (if Pixmap.alpha is true) transparency values for all pixels. It is an area of width
* height * n bytes. Each n bytes define one pixel. Each successive n bytes yield another pixel in
scanline order. Subsequent scanlines follow each other with no padding. E.g. for an RGBA colorspace
this means, samples is a sequence of bytes like . . . , R, G, B, A, . . . , and the four byte values R, G, B, A
define one pixel.
This area can be passed to other graphics libraries like PIL (Python Imaging Library) to do additional
processing like saving the pixmap in other image formats.
Note:
• The underlying data is typically a large memory area, from which a bytes copy is made for this
attribute . . . each time you access it: for example an RGB-rendered letter page has a samples size
of almost 1.4 MB. So consider assigning a new variable to it or use the memoryview version
Pixmap.samples_mv (new in v1.18.17).
• Any changes to the underlying data are available only after accessing this attribute again. This is
different from using the memoryview version.
Type bytes
samples_mv
(New in v1.18.17)
Like Pixmap.samples, but in Python memoryview format. It is built pointing to the memory in
the pixmap – not from a copy of it. So its creation speed is independent from the pixmap size, and any
changes to pixels will be available immediately.
Copies like bytearray(pix.samples_mv), or bytes(pixmap.samples_mv) are equivalent
to and can be used in place of pix.samples.
We also have len(pix.samples) == len(pix.samples_mv).
Look at this example from a 2 MB JPEG: the memoryview is ten thousand times faster:
Type memoryview
samples_ptr
(New in v1.18.17)
Python pointer to the pixel area. This is a special integer format, which can be used by supporting
applications (such as PyQt) to directly address the samples area and thus build their images extremely
fast. For example:
Both of the above lead to the same Qt image, but (2) can be many hundred times faster, because it
avoids an additional copy of the pixel area.
Type int
size
Contains len(pixmap). This will generally equal len(pix.samples) plus some platform-specific value for
defining other attributes of the object.
Type int
width
w
Width of the region in pixels.
Type int
height
h
Height of the region in pixels.
Type int
x
X-coordinate of top-left corner in pixels. Cannot directly be changed – use Pixmap.set_origin().
Type int
y
Y-coordinate of top-left corner in pixels. Cannot directly be changed – use Pixmap.set_origin().
Type int
n
Number of components per pixel. This number depends on colorspace and alpha. If colorspace is not
None (stencil masks), then Pixmap.n - Pixmap.aslpha == pixmap.colorspace.n is true. If colorspace is
None, then n == alpha == 1.
Type int
xres
Horizontal resolution in dpi (dots per inch). Please also see resolution. Cannot directly be changed
– use Pixmap.set_dpi().
Type int
yres
Vertical resolution in dpi (dots per inch). Please also see resolution. Cannot directly be changed –
use Pixmap.set_dpi().
Type int
interpolate
An information-only boolean flag set to True if the image will be drawn using “linear interpolation”. If
False “nearest neighbour sampling” will be used.
Type bool
The following file types are supported as input to construct pixmaps: BMP, JPEG, GIF, TIFF, JXR, JPX, PNG,
PAM and all of the Portable Anymap family (PBM, PGM, PNM, PPM). This support is two-fold:
1. Directly create a pixmap with Pixmap(filename) or Pixmap(byterray). The pixmap will then have properties as
determined by the image.
2. Open such files with fitz.open(. . . ). The result will then appear as a document containing one single page.
Creating a pixmap of this page offers all the options available in this context: apply a matrix, choose colorspace
and alpha, confine the pixmap to a clip area, etc.
SVG images are only supported via method 2 above, not directly as pixmaps. But remember: the result of this is a
raster image as is always the case with pixmaps1 .
A number of image output formats are supported. You have the option to either write an image directly to a file
(Pixmap.save()), or to generate a bytes object (Pixmap.tobytes()). Both methods accept a 3-letter string
identifying the desired format (Format column below). Please note that not all combinations of pixmap colorspace,
transparency support (alpha) and image format are possible.
Note:
• Not all image file types are supported (or at least common) on all OS platforms. E.g. PAM and the Portable
Anymap formats are rare or even unknown on Windows.
• Especially pertaining to CMYK colorspaces, you can always convert a CMYK pixmap to an RGB pixmap with
rgb_pix = fitz.Pixmap(fitz.csRGB, cmyk_pix) and then save that in the desired format.
• As can be seen, MuPDF’s image support range is different for input and output. Among those supported both
ways, PNG is probably the most popular. We recommend using Pillow whenever you face a support gap.
• We also recommend using “ppm” formats as input to tkinter’s PhotoImage method like this: tkimg = tkin-
ter.PhotoImage(data=pix.tobytes(“ppm”)) (also see the tutorial). This is very fast (60 times faster than PNG)
and will work under Python 2 or 3.
6.14 Point
enough, look for other SVG-to-PDF conversion tools like the Python packages svglib, CairoSVG, Uniconvertor or the Java solution Apache Batik.
Have a look at our Wiki for more examples.
Class API
class Point
__init__(self )
__init__(self, x, y)
__init__(self, point)
__init__(self, sequence)
Overloaded constructors.
Without parameters, Point(0, 0) will be created.
With another point specified, a new copy will be crated, “sequence” is a Python sequence of
2 numbers (see Using Python Sequences as Arguments in PyMuPDF).
Parameters
• x (float) – x coordinate of the point
• y (float) – y coordinate of the point
distance_to(x[, unit ])
Calculate the distance to x, which may be point_like or rect_like. The distance is
given in units of either pixels (default), inches, centimeters or millimeters.
Parameters
• x (point_like,rect_like) – to which to compute the distance.
• unit (str) – the unit to be measured in. One of “px”, “in”, “cm”, “mm”.
Return type float
Returns
the distance to x. If this is rect_like, then the distance
• is the length of the shortest line connecting to one of the rectangle sides
• is calculated to the finite version of it
• is zero if it contains the point
norm()
(New in version 1.16.0)
Return the Euclidean norm (the length) of the point as a vector. Equals result of function abs().
transform(m)
Apply a matrix to the point and replace it with the result.
unit
Result of dividing each coordinate by norm(point), the distance of the point to (0,0). This is a vector of
length 1 pointing in the same direction as the point does. Its x, resp. y values are equal to the cosine, resp.
sine of the angle this vector (and the point itself) has with the x axis.
Type Point
abs_unit
Same as unit above, replacing the coordinates with their absolute values.
Type Point
x
The x coordinate
Type float
y
The y coordinate
Type float
Note:
• This class adheres to the Python sequence protocol, so components can be accessed via their index, too. Also
refer to Using Python Sequences as Arguments in PyMuPDF.
• Rectangles can be used with arithmetic operators – see chapter Operator Algebra for Geometry Objects.
6.15 Quad
Represents a four-sided mathematical shape (also called “quadrilateral” or “tetragon”) in the plane, defined as a se-
quence of four Point objects ul, ur, ll, lr (conveniently called upper left, upper right, lower left, lower right).
Quads can be obtained as results of text search methods (Page.search_for()), and they are used to define text
marker annotations (see e.g. Page.add_squiggly_annot() and friends), and in several draw methods (like
Page.draw_quad() / Shape.draw_quad(), Page.draw_oval()/ Shape.draw_quad()).
Note:
• If the corners of a rectangle are transformed with a rotation, scale or translation Matrix, then the resulting quad
is rectangular (= congruent to a rectangle), i.e. all of its corners again enclose angles of 90 degrees. Property
Quad.is_rectangular checks whether a quad can be thought of being the result of such an operation.
• This is not true for all matrices: e.g. shear matrices produce parallelograms, and non-invertible matrices deliver
“degenerate” tetragons like triangles or lines.
• Attribute Quad.rect obtains the envelopping rectangle. Vice versa, rectangles now have attributes Rect.
quad, resp. IRect.quad to obtain their respective tetragon versions.
Class API
class Quad
__init__(self )
__init__(self, ul, ur, ll, lr)
__init__(self, quad)
__init__(self, sequence)
Overloaded constructors: “ul”, “ur”, “ll”, “lr” stand for point_like objects (the four corners), “se-
quence” is a Python sequence with four point_like objects.
If “quad” is specified, the constructor creates a new copy of it.
Without parameters, a quad consisting of 4 copies of Point(0, 0) is created.
transform(matrix)
Modify the quadrilateral by transforming each of its corners with a matrix.
Parameters matrix (matrix_like) – the matrix.
morph(fixpoint, matrix)
(New in version 1.17.0) “Morph” the quad with a matrix-like using a point-like as fixed point.
Parameters
• fixpoint (point_like) – the point.
• matrix (matrix_like) – the matrix.
Returns a new quad (no operation if this is the infinite quad).
rect
The smallest rectangle containing the quad, represented by the blue area in the following picture.
Type Rect
ul
Upper left point.
Type Point
ur
Upper right point.
Type Point
ll
Lower left point.
Type Point
lr
Lower right point.
Type Point
is_convex
(New in version 1.16.1)
Checks if for any two points of the quad, all points on their connecting line also belong to the quad.
Type bool
is_empty
True if enclosed area is zero, which means that at least three of the four corners are on the same line. If
this is false, the quad may still be degenerate or not look like a tetragon at all (triangles, parallelograms,
trapezoids, . . . ).
Type bool
is_rectangular
True if all corner angles are 90 degrees. This implies that the quad is convex and not empty.
Type bool
width
The maximum length of the top and the bottom side.
Type float
height
The maximum length of the left and the right side.
Type float
6.15.1 Remark
This class adheres to the sequence protocol, so components can be dealt with via their indices, too. Also refer to Using
Python Sequences as Arguments in PyMuPDF.
We are still in process to extend algebraic operations to quads. Multiplication and division with / by numbers and
matrices are already defined. Addition, subtraction and any unary operations may follow when we see an actual need.
Independent from the previous remark, the following containment checks are possible:
• point in quad – check whether a point is inside a quadrilateral.
• rect in quad – check whether a rectangle is inside a quadrilateral. This is done by checking the containment
of its four corners.
• quad in quad – check whether some quad is contained in some other quadrilateral. This is done by checking
the containment of its four corners.
Please note the following interesting detail:
For a rectangle, only its top-left point belongs to it. Since v1.19.0, rectangles are defined to be “open”, such that its
bottom and its right edge do not belong to it – including the respective corners. But for quads there exists no notion
like “openness”, so we have the following surprising situation:
6.16 Rect
Rect represents a rectangle defined by four floating point numbers x0, y0, x1, y1. They are treated as being coordinates
of two diagonally opposite points. The first two numbers are regarded as the “top left” corner P(x0,y0) and P(x1,y1) as the
“bottom right” one. However, these two properties need not coincide with their intuitive meanings – read on.
1. Rect(P(x0,y0) , P(x1,y1) )
2. Rect(P(x1,y1) , P(x0,y0) )
3. Rect(P(x0,y1) , P(x1,y0) )
4. Rect(P(x1,y0) , P(x0,y1) )
(Changed in v1.19.0) Hence some classification:
• A rectangle is called valid if x0 <= x1 and y0 <= y1 (i.e. the bottom right point is “south-eastern” to the
top left one), otherwise invalid. Of the four alternatives above, only the first is valid. Please take into account,
that in MuPDF’s coordinate system, the y-axis is oriented from top to bottom. Invalid rectangles have been
called infinite in earlier versions.
• A rectangle is called empty if x0 >= x1 or y0 >= y1. This implies, that invalid rectangles are also always
empty. And width (resp. height) is set to zero if x0 > x1 (resp. y0 > y1). In previous versions, a
rectangle was empty only if one of width or height was zero.
• Rectangle coordinates cannot be outside the number range from FZ_MIN_INF_RECT = -2147483648 to
FZ_MAX_INF_RECT = 2147483520. Both values have been chosen, because they are the smallest / largest
32bit integers that survive C float conversion roundtrips. In previous versions there was no limit for coordinate
values.
• There is exactly one “infinite” rectangle, defined by x0 = y0 = FZ_MIN_INF_RECT and x1 = y1 =
FZ_MAX_INF_RECT. It contains every other rectangle. It is mainly used for technical purposes – e.g. when a
function call should ignore a formally required rectangle argument. This rectangle is not empty.
• Rectangles are (semi-) open: The right and the bottom edges (including the resp. corners) are not considered
part of the rectangle. This implies, that only the top-left corner (x0, y0) can ever belong to the rectangle -
the other three corners never do. An empty rectangle contains no corners at all.
• There are new top level functions defining infinite and standard empty rectangles and quads, see
INFINITE_RECT() and friends.
Class API
class Rect
__init__(self )
__init__(self, x0, y0, x1, y1)
__init__(self, top_left, bottom_right)
__init__(self, top_left, x1, y1)
__init__(self, x0, y0, bottom_right)
__init__(self, rect)
__init__(self, sequence)
Overloaded constructors: top_left, bottom_right stand for point_like objects, “sequence” is a Python
sequence type of 4 numbers (see Using Python Sequences as Arguments in PyMuPDF), “rect” means
another rect_like, while the other parameters mean coordinates.
If “rect” is specified, the constructor creates a new copy of it.
Without parameters, the empty rectangle Rect(0.0, 0.0, 0.0, 0.0) is created.
round()
Creates the smallest containing IRect. This is not the same as simply rounding the rectangle’s edges: The
top left corner is rounded upwards and to the left while the bottom right corner is rounded downwards and
to the right.
transform(m)
Transforms the rectangle with a matrix and replaces the original. If the rectangle is empty or infinite,
this is a no-operation.
Parameters m (Matrix) – The matrix for the transformation.
Return type Rect
Returns the smallest rectangle that contains the transformed original.
intersect(r)
The intersection (common rectangular area, largest rectangle contained in both) of the current rectangle
and r is calculated and replaces the current rectangle. If either rectangle is empty, the result is also
empty. If r is infinite, this is a no-operation. If the rectangles are (mathematically) disjoint sets, then the
result is invalid. If the result is valid but empty, then the rectangles touch each other in a corner or (part
of) a side.
Parameters r (Rect) – Second rectangle
include_rect(r)
The smallest rectangle containing the current one and r is calculated and replaces the current one. If
either rectangle is infinite, the result is also infinite. If one is empty, the other one will be taken as the
result.
Parameters r (Rect) – Second rectangle
include_point(p)
The smallest rectangle containing the current one and point p is calculated and replaces the current one.
The infinite rectangle remains unchanged. To create a rectangle containing a series of points, start with
(the empty) fitz.Rect(p1, p1) and successively include the remaining points.
Parameters p (Point) – Point to include.
get_area([unit ])
Calculate the area of the rectangle and, with no parameter, equals abs(rect). Like an empty rectangle, the
area of an infinite rectangle is also zero. So, at least one of fitz.Rect(p1, p2) and fitz.Rect(p2, p1) has a
zero area.
Parameters unit (str) – Specify required unit: respective squares of px (pixels, default),
in (inches), cm (centimeters), or mm (millimeters).
Return type float
contains(x)
Checks whether x is contained in the rectangle. It may be an IRect, Rect, Point or number. If x is an empty
rectangle, this is always true. If the rectangle is empty this is always False for all non-empty rectangles
and for all points. x in rect and rect.contains(x) are equivalent.
Parameters x (rect_like or point_like.) – the object to check.
Return type bool
intersects(r)
Checks whether the rectangle and a rect_like “r” contain a common non-empty Rect. This will always
be False if either is infinite or empty.
Parameters r (rect_like) – the rectangle to check.
Return type bool
torect(rect)
(New in version 1.19.3)
Compute the matrix which transforms this rectangle to a given one.
Parameters rect (rect_like) – the target rectangle. Must not be empty or infinite.
Return type Matrix
Returns
a matrix mat such that self * mat = rect. Can for example be used to transform
between the page and the pixmap coordinates.
Note: Suppose you want to check whether any of the words “pixmap” is invisible,
because the text color equals the ambient color – e.g. white on white. We make a
pixmap and check the “color environment” of each word:
morph(fixpoint, matrix)
(New in version 1.17.0)
Return a new quad after applying a matrix to the rectangle using the fixed point fixpoint.
Parameters
• fixpoint (point_like) – the fixed point.
• matrix (matrix_like) – the matrix.
Returns a new Quad. This a wrapper for the same-named quad method. If infinite, the
infinite quad is returned.
norm()
(New in version 1.16.0)
Return the Euclidean norm of the rectangle treated as a vector of four numbers.
normalize()
Replace the rectangle with its valid version. This is done by shuffling the rectangle corners. After
completion of this method, the bottom right corner will indeed be south-eastern to the top left one (but
may still be empty).
irect
Equals result of method round().
top_left
tl
Equals Point(x0, y0).
Type Point
top_right
tr
Equals Point(x1, y0).
Type Point
bottom_left
bl
Equals Point(x0, y1).
Type Point
bottom_right
br
Equals Point(x1, y1).
Type Point
quad
The quadrilateral Quad(rect.tl, rect.tr, rect.bl, rect.br).
Type Quad
width
Width of the rectangle. Equals max(x1 - x0, 0).
Return type float
height
Height of the rectangle. Equals max(y1 - y0, 0).
Return type float
x0
X-coordinate of the left corners.
Type float
y0
Y-coordinate of the top corners.
Type float
x1
X-coordinate of the right corners.
Type float
y1
Y-coordinate of the bottom corners.
Type float
is_infinite
True if this is the infinite rectangle.
Type bool
is_empty
True if rectangle is empty.
Type bool
is_valid
True if rectangle is valid.
Type bool
Note:
• This class adheres to the Python sequence protocol, so components can be accessed via their index, too. Also
refer to Using Python Sequences as Arguments in PyMuPDF.
• Rectangles can be used with arithmetic operators – see chapter Operator Algebra for Geometry Objects.
6.17 Shape
This class allows creating interconnected graphical elements on a PDF page. Its methods have the same meaning and
name as the corresponding Page methods.
In fact, each Page draw method is just a convenience wrapper for (1) one shape draw method, (2) the finish()
method, and (3) the commit() method. For page text insertion, only the commit() method is invoked. If many
draw and text operations are executed for a page, you should always consider using a Shape object.
Several draw methods can be executed in a row and each one of them will contribute to one drawing. Once the drawing
is complete, the finish() method must be invoked to apply color, dashing, width, morphing and other attributes.
Draw methods of this class (and insert_textbox()) are logging the area they are covering in a rectangle
(Shape.rect). This property can for instance be used to set Page.cropbox_position.
Text insertions insert_text() and insert_textbox() implicitely execute a “finish” and therefore only
require commit() to become effective. As a consequence, both include parameters for controlling prperties like
colors, etc.
Class API
class Shape
__init__(self, page)
Create a new drawing. During importing PyMuPDF, the fitz.Page object is being given the convenience
method new_shape() to construct a Shape object. During instantiation, a check will be made whether we
do have a PDF page. An exception is otherwise raised.
Parameters page (Page) – an existing page of a PDF document.
draw_line(p1, p2)
Draw a line from point_like objects p1 to p2.
Parameters
• p1 (point_like) – starting point
• p2 (point_like) – end point
Return type Point
Returns the end point, p2.
draw_squiggle(p1, p2, breadth=2)
Draw a squiggly (wavy, undulated) line from point_like objects p1 to p2. An integer number of full
wave periods will always be drawn, one period having a length of 4 * breadth. The breadth parameter
will be adjusted as necessary to meet this condition. The drawn line will always turn “left” when leaving
p1 and always join p2 from the “right”.
Parameters
Here is an example of three connected lines, forming a closed, filled triangle. Little arrows indicate the
stroking direction.
Note: Waves drawn are not trigonometric (sine / cosine). If you need that, have a look at draw-sines.py.
Note: The points do not need to be different – experiment a bit with some of them being equal!
Example:
draw_oval(tetra)
Draw an “ellipse” inside the given tetragon (quadrilateral). If it is a square, a regular circle is drawn, a
general rectangle will result in an ellipse. If a quadrilateral is used instead, a plethora of shapes can be the
result.
The drawing starts and ends at the middle point of the line bottom-left -> top-left corners in
an anti-clockwise movement.
Parameters tetra (rect_like,quad_like) – rect_like or quad_like.
Changed in version 1.14.5: Quads are now also supported.
Return type Point
Returns the middle point of line rect.bl -> rect.tl, or resp. quad.ll ->
quad.ul. Look at just a few examples here, or at the quad-show?.py scripts in the
PyMuPDF-Utilities repository.
draw_circle(center, radius)
Draw a circle given its center and radius. The drawing starts and ends at point center - (radius,
0) in an anti-clockwise movement. This point is the middle of the enclosing square’s left side.
This is a shortcut for draw_sector(center, start, 360, fullSector=False). To draw
the same circle in a clockwise movement, use -360 as degrees.
Parameters
• center (point_like) – the center of the circle.
• radius (float) – the radius of the circle. Must be positive.
the other end point of the arc. Can be used as starting point for a fol-
lowing invocation to create logically connected pies charts. Examples:
draw_rect(rect)
Draw a rectangle. The drawing starts and ends at the top-left corner in an anti-clockwise movement.
Parameters rect (rect_like) – where to put the rectangle on the page.
Return type Point
Returns top-left corner of the rectangle.
draw_quad(quad)
Draw a quadrilateral. The drawing starts and ends at the top-left corner (Quad.ul) in an anti-clockwise
movement. It is a shortcut of draw_polyline() with the argument (ul, ll, lr, ur, ul).
Parameters quad (quad_like) – where to put the tetragon on the page.
Return type Point
Returns Quad.ul.
finish(width=1, color=None, fill=None, lineCap=0, lineJoin=0, dashes=None, closePath=True,
even_odd=False, morph=(fixpoint, matrix), stroke_opacity=1, fill_opacity=1, oc=0)
Finish a set of draw*() methods by applying Common Parameters to all of them.
It has no effect on Shape.insert_text() and Shape.insert_textbox().
The method also supports morphing the compound drawing using Point fixpoint and Matrix matrix.
Parameters
• morph (sequence) – morph the text or the compound drawing around some
arbitrary Point fixpoint by applying Matrix matrix to it. This implies that fixpoint
is a fixed point of this operation: it will not change its position. Default is no
morphing (None). The matrix can contain any values in its first 4 components,
matrix.e == matrix.f == 0 must be true, however. This means that any combination
of scaling, shearing, rotating, flipping, etc. is possible, but translations are not.
• stroke_opacity (float) – (new in v1.18.1) set transparency for stroke col-
ors. Value < 0 or > 1 will be ignored. Default is 1 (intransparent).
• fill_opacity (float) – (new in v1.18.1) set transparency for fill colors. De-
fault is 1 (intransparent).
• even_odd (bool) – request the “even-odd rule” for filling operations. Default
is False, so that the “nonzero winding number rule” is used. These rules are
alternative methods to apply the fill color where areas overlap. Only with fairly
complex shapes a different behavior is to be expected with these rules. For an in-
depth explanation, see Adobe PDF References, pp. 137 ff. Here is an example to
demonstrate the difference.
• oc (int) – (new in v1.18.4) the xref number of an OCG or OCMD to make this
drawing conditionally displayable.
Parameters
• point (point_like) – the bottom-left position of the first character of text
in pixels. It is important to understand, how this works in conjunction with
the rotate parameter. Please have a look at the following picture. The small
red dots indicate the positions of point in each of the four possible cases.
The method will reset attributes Shape.rect, lastPoint, draw_cont, text_cont and
totalcont. Afterwards, the shape object can be reused for the same page.
Parameters overlay (bool) – determine whether to put content in foreground (default)
or background. Relevant only, if the page already has a non-empty contents object.
———- Attributes ———-
doc
For reference only: the page’s document.
Type Document
page
For reference only: the owning page.
Type Page
height
Copy of the page’s height
Type float
width
Copy of the page’s width.
Type float
draw_cont
Accumulated command buffer for draw methods since last finish. Every finish method will append its
commands to Shape.totalcont.
Type str
text_cont
Accumulated text buffer. All text insertions go here. This buffer will be appended to totalcont
commit(), so that text will never be covered by drawings in the same Shape.
Type str
rect
Rectangle surrounding drawings. This attribute is at your disposal and may be changed at any time.
Its value is set to None when a shape is created or committed. Every draw* method, and Shape.
insert_textbox() update this property (i.e. enlarge the rectangle as needed). Morphing operations,
however (Shape.finish(), Shape.insert_textbox()) are ignored.
A typical use of this attribute would be setting Page.cropbox_position to this value, when you
are creating shapes for later or external use. If you have not manipulated the attribute yourself, it should
reflect a rectangle that contains all drawings so far.
If you have used morphing and need a rectangle containing the morphed objects, use the following code:
>>> # assuming ...
>>> morph = (point, matrix)
>>> # ... recalculate the shape rectangle like so:
>>> shape.rect = (shape.rect - fitz.Rect(point, point)) * ~matrix + fitz.
˓→Rect(point, point)
Type Rect
totalcont
Total accumulated command buffer for draws and text insertions. This will be used by Shape.
commit().
Type str
lastPoint
For reference only: the current point of the drawing path. It is None at Shape creation and after each
finish() and commit().
Type Point
6.17.1 Usage
A drawing object is constructed by shape = page.new_shape(). After this, as many draw, finish and text insertions
methods as required may follow. Each sequence of draws must be finished before the drawing is committed. The
overall coding pattern looks like this:
>>> shape = page.new_shape()
>>> shape.draw1(...)
>>> shape.draw2(...)
>>> ...
>>> shape.finish(width=..., color=..., fill=..., morph=...)
>>> shape.draw3(...)
>>> shape.draw4(...)
>>> ...
>>> shape.finish(width=..., color=..., fill=..., morph=...)
>>> ...
>>> shape.insert_text*
>>> ...
>>> shape.commit()
>>> ....
Note:
1. Each finish() combines the preceding draws into one logical shape, giving it common colors, line width, morph-
ing, etc. If closePath is specified, it will also connect the end point of the last draw with the starting point of the
first one.
2. To successfully create compound graphics, let each draw method use the end point of the previous one as its
starting point. In the above pseudo code, draw2 should hence use the returned Point of draw1 as its starting
point. Failing to do so, would automatically start a new path and finish() may not work as expected (but it won’t
complain either).
3. Text insertions may occur anywhere before the commit (they neither touch Shape.draw_cont nor Shape.
lastPoint). They are appended to Shape.totalcont directly, whereas draws will be appended by Shape.finish.
4. Each commit takes all text insertions and shapes and places them in foreground or background on the page –
thus providing a way to control graphical layers.
5. Only commit will update the page’s contents, the other methods are basically string manipulations.
6.17.2 Examples
2. Create a regular n-edged polygon (fill yellow, red border). We use draw_sector() only to calculate the points on
the circumference, and empty the draw command buffer again before drawing the polygon:
fontname (str)
In general, there are three options:
1. Use one of the standard PDF Base 14 Fonts. In this case, fontfile must not be specified and “Hel-
vetica” is used if this parameter is omitted, too.
2. Choose a font already in use by the page. Then specify its reference name prefixed with a slash
“/”, see example below.
3. Specify a font file present on your system. In this case choose an arbitrary, but new name for this
parameter (without “/” prefix).
If inserted text should re-use one of the page’s fonts, use its reference name appearing in get_fonts()
like so:
Suppose the font list has the item [1024, 0, ‘Type1’, ‘NimbusMonL-Bold’, ‘R366’], then specify fontname
= “/R366”, fontfile = None to use font NimbusMonL-Bold.
fontfile (str)
File path of a font existing on your computer. If you specify fontfile, make sure you use a fontname not
occurring in the above list. This new font will be embedded in the PDF upon doc.save(). Similar to new
images, a font file will be embedded only once. A table of MD5 codes for the binary font contents is used
to ensure this.
set_simple (bool)
Fonts installed from files are installed as Type0 fonts by default. If you want to use 1-byte characters
only, set this to true. This setting cannot be reverted. Subsequent changes are ignored.
fontsize (float)
Font size of text.
dashes (str)
Causes lines to be drawn dashed. The general format is "[n m] p" of (up to) 3 floats denoting pixel
lengths. n is the dash length, m (optional) is the subsequent gap length, and p (the “phase” - required,
even if 0!) specifies how many pixels should be skipped before the dashing starts. If m is omitted, it
defaults to n.
A continuous line (no dashes) is drawn with "[] 0" or None or "". Examples:
• Specifying "[3 4] 0" means dashes of 3 and gaps of 4 pixels following each other.
• "[3 3] 0" and "[3] 0" do the same thing.
For (the rather complex) details on how to achieve sophisticated dashing effects, see Adobe PDF Refer-
ences, page 217.
border_width (float)
Set the border width for text insertions. New in v1.14.9. Relevant only if the render mode argument is
used with a value greater zero.
render_mode (int)
New in version 1.14.9: Integer in range(8) which controls the text appearance (Shape.
insert_text() and Shape.insert_textbox()). See page 246 in Adobe PDF References. New
in v1.14.9. These methods now also differentiate between fill and stroke colors.
• For default 0, only the text fill color is used to paint the text. For backward compatibility, using the
color parameter instead also works.
• For render mode 1, only the border of each glyph (i.e. text character) is drawn with a thickness
as set in argument border_width. The color chosen in the color argument is taken for this, the fill
parameter is ignored.
• For render mode 2, the glyphs are filled and stroked, using both color parameters and the specified
border width. You can use this value to simulate bold text without using another font: choose the
same value for fill and color and an appropriate value for border_width.
• For render mode 3, the glyphs are neither stroked nor filled: the text becomes invisible.
The following examples use border_width=0.3, together with a fontsize of 15. Stroke color is blue and
fill color is some yellow.
overlay (bool)
Causes the item to appear in foreground (default) or background.
morph (sequence)
Causes “morphing” of either a shape, created by the draw*() methods, or the text inserted by page methods
insert_textbox() / insert_text(). If not None, it must be a pair (fixpoint, matrix), where fixpoint is a Point
and matrix is a Matrix. The matrix can be anything except translations, i.e. matrix.e == matrix.f ==
0 must be true. The point is used as a fixed point for the matrix operation. For example, if matrix is a
rotation or scaling, then fixpoint is its center. Similarly, if matrix is a left-right or up-down flip, then the
mirroring axis will be the vertical, respectively horizontal line going through fixpoint, etc.
Note: Several methods contain checks whether the to be inserted items will actually fit into the page (like
Shape.insert_text(), or Shape.draw_rect()). For the result of a morphing operation there
is however no such guaranty: this is entirely the rpogrammer’s responsibility.
lineJoin (int)
New in version 1.14.15: Controls the way how line connections look like. This may be either as a sharp
edge (0), a rounded join (1), or a cut-off edge (2, “butt”).
closePath (bool)
Causes the end point of a drawing to be automatically connected with the starting point (by a straight
line).
6.18 TextPage
This class represents text and images shown on a document page. All MuPDF document types are supported.
The usual ways to create a textpage are DisplayList.get_textpage() and Page.get_textpage(). Be-
cause there is a limited set of methods in this class, there exist wrappers in Page which are handier to use. The last
column of this table shows these corresponding Page methods.
For a description of what this class is all about, see Appendix 2.
Class API
class TextPage
extractText()
extractTEXT()
Return a string of the page’s complete text. The text is UTF-8 unicode and in the same sequence as
specified at the time of document creation.
Return type str
extractBLOCKS()
Textpage content as a list of text lines grouped by block. Each list items looks like this:
(x0, y0, x1, y1, "lines in the block", block_no, block_type)
The first four entries are the block’s bbox coordinates, block_type is 1 for an image block, 0 for text.
block_no is the block sequence number. Multiple text lines are joined via line breaks.
For an image block, its bbox and a text line with some image meta information is included – not the
image content.
This is a high-speed method with just enough information to output plain text in desired reading sequence.
Return type list
extractWORDS()
Textpage content as a list of single words with bbox information. An item of this list looks like this:
(x0, y0, x1, y1, "word", block_no, line_no, word_no)
Everything delimited by spaces is treated as a “word”. This is a high-speed method which e.g. allows
extracting text from within given areas or recovering the text reading sequence.
Return type list
extractHTML()
Textpage content as a string in HTML format. This version contains complete formatting and positioning
information. Images are included (encoded as base64 strings). You need an HTML package to interpret
the output in Python. Your internet browser should be able to adequately display this information, but see
Controlling Quality of HTML Output.
Return type str
extractDICT()
Textpage content as a Python dictionary. Provides same information detail as HTML. See below for the
structure.
Return type dict
extractJSON()
Textpage content as a JSON string. Created by json.dumps(TextPage.extractDICT()). It is
included for backlevel compatibility. You will probably use this method ever only for outputting the result
to some file. The method detects binary image data and converts them to base64 encoded strings.
Return type str
extractXHTML()
Textpage content as a string in XHTML format. Text information detail is comparable with
extractTEXT(), but also contains images (base64 encoded). This method makes no attempt to re-
create the original visual appearance.
Return type str
extractXML()
Textpage content as a string in XML format. This contains complete formatting information about every
single character on the page: font, size, line, paragraph, location, color, etc. Contains no images. You
need an XML package to interpret the output in Python.
Example Quad versus Rect: when searching for needle “pymupdf”, then the corresponding entry will
either be the blue rectangle, or, if quads was specified, the quad Quad(ul, ur, ll, lr).
rect
The rectangle associated with the text page. This either equals the rectangle of the creating page or the
clip parameter of Page.get_textpage() and text extration / searching methods.
Note: The output of text searching and most text extractions is restricted to this rectangle. (X)HTML
and XML output will however always extract the full page.
Please also note, that only bboxes (= rect_like 4-tuples) are returned, whereas a TextPage actually has the full
position information – in Quad format. The reason for this decision is again a memory consideration: a quad_like
needs 488 bytes (3 times the size of a rect_like). Given the mentioned amounts of generated bboxes, returning
quad_like information would have a significant impact.
In the vast majority of cases, we are dealing with horizontal text only, where bboxes provide entirely sufficient
information.
In addition, the full quad information is not lost: it can be recovered as needed for lines, spans, and characters by
using the appropriate function from the following list:
• recover_quad() – the quad of a complete span
• recover_span_quad() – the quad of a character subset of a span
• recover_line_quad() – the quad of a line
• recover_char_quad() – the quad of a character
As mentioned, using these functions is ever only needed, if the text is not written horizontally – line["dir"] !=
(1, 0) – and you need the quad for text marker annotations (Page.add_highlight_annot() and friends).
Key Value
width width of the clip rectangle (float)
height height of the clip rectangle (float)
blocks list of block dictionaries
Block dictionaries come in two different formats for image blocks and for text blocks.
• (Changed in v1.18.0) – new dict key number, the block number.
• (Changed in v1.18.11) – new dict key transform, the image transformation matrix for image blocks.
• (Changed in v1.18.11) – new dict key size, the size of the image in bytes for image blocks.
Image block:
Key Value
type 1 = image (int)
bbox image bbox on page (rect_like)
number block count (int)
ext image type (str), as file extension, see below
width original image width (int)
height original image height (int)
colorspace colorspace component count (int)
xres resolution in x-direction (int)
yres resolution in y-direction (int)
bpc bits per component (int)
transform matrix transforming image rect to bbox (matrix_like)
size size of the image in bytes (int)
image image content (bytes)
Possible values of the “ext” key are “bmp”, “gif”, “jpeg”, “jpx” (JPEG 2000), “jxr” (JPEG XR), “png”, “pnm”, and
“tiff”.
Note:
1. An image block is generated for all and every image occurrence on the page. Hence there may be duplicates,
if an image is shown at different locations.
2. TextPage and corresponding method Page.get_text() are available for all document types. Only for PDF
documents, methods Document.get_page_images() / Page.get_images() offer some overlapping
functionality as far as image lists are concerned. But both lists may or may not contain the same items. Any
differences are most probably caused by one of the following:
• “Inline” images (see page 214 of the Adobe PDF References) of a PDF page are contained in a textpage,
but do not appear in Page.get_images().
• Annotations may also contain images – these will not appear in Page.get_images().
• Image blocks in a textpage are generated for every image location – whether or not there are any dupli-
cates. This is in contrast to Page.get_images(), which will list each image only once (per reference
name).
• Images mentioned in the page’s object definition will always appear in Page.get_images()1 . But
it may happen, that there is no “display” command in the page’s contents (erroneously or on purpose).
In this case the image will not appear in the textpage.
3. The image’s “transformation matrix” is defined as the matrix, for which the expression bbox / transform
== fitz.Rect(0, 0, 1, 1) is true, lookup details here: Image Transformation Matrix.
1 Image specifications for a PDF page are done in a page’s (sub-) dictionary, called “/Resources”. Resource dictionaries can be inherited
from the page’s parent object (usually the catalog). The PDF creator may e.g. define one /Resources on file level, naming all images and all
fonts ever used by any page. In these cases, Page.get_images() and Page.get_fonts() will return the same lists for all pages.
Text block:
Key Value
type 0 = text (int)
bbox block rectangle, rect_like
number block count (int)
lines list of text line dictionaries
Key Value
bbox line rectangle, rect_like
wmode writing mode (int): 0 = horizontal, 1 = vertical
dir writing direction, point_like
spans list of span dictionaries
The value of key “dir” is the unit vector dir = (cosine, sine) of the angle, which the text has relative to the
x-axis. See the following picture: The word in each quadrant (counter-clockwise from top-right to bottom-right) is
rotated by 30, 120, 210 and 300 degrees respectively.
Spans contain the actual text. A line contains more than one span only, if it contains text with different font properties.
(Changed in version 1.14.17) Spans now also have a bbox key (again). (Changed in version 1.17.6) Spans now also
have an origin key.
Key Value
bbox span rectangle, rect_like
origin the first character’s origin, point_like
font font name (str)
ascender ascender of the font (float)
descender descender of the font (float)
size font size (float)
flags font characteristics (int)
color text color in sRGB format (int)
text (only for extractDICT()) text (str)
chars (only for extractRAWDICT()) list of character dictionaries
(New in version 1.16.0): “color” is the text color encoded in sRGB (int) format, e.g. 0xFF0000 for red. There are
functions for converting this integer back to formats (r, g, b) (PDF with float values from 0 to 1) sRGB_to_pdf(),
or (R, G, B), sRGB_to_rgb() (with integer values from 0 to 255).
(New in v1.18.5): “ascender” and “descender” are font properties, provided relative to fontsize 1. Note that descender
is a negative value. The following picture shows the relationship to other values and properties.
These numbers may be used to compute the minimum height of a character (or span) – as opposed to the standard
height provided in the “bbox” values (which actually represents the line height). The following code recalculates the
span bbox to have a height of fontsize exactly fitting the text inside:
>>> a = span["ascender]
>>> d = span["descender"]
>>> r = fitz.Rect(span["bbox"])
>>> o = fitz.Point(span["origin"]) # its y-value is the baseline
>>> r.y1 = o.y - span["size"] * d / (a - d)
>>> r.y0 = r.y1 - span["size"]
>>> # r now is a rectangle of height 'fontsize'
Caution: The above calculation may deliver a larger height! This may e.g. happen for OCRed documents,
where the risk of all sorts of text artifacts is high. MuPDF tries to come up with a reasonable bbox height,
independently from the fontsize found in the PDF. So please ensure that the height of span["bbox"] is larger
than span["size"].
Note: You may request PyMuPDF to do all of the above automatically by executing fitz.TOOLS.
set_small_glyph_heights(True). This sets a global parameter so that all subsequent text searches and
text extractions are based on reduced glyph heights, where meaningful.
The following shows the original span rectangle in red and the rectangle with re-computed height in blue.
“flags” is an integer, which represents font properties except for the first bit 0. They are to be interpreted like this:
• bit 0: superscripted (20 ) – not a font property, detected by MuPDF code.
• bit 1: italic (21 )
• bit 2: serifed (22 )
• bit 3: monospaced (23 )
• bit 4: bold (24 )
Test these characteristics like so:
Bits 1 thru 4 are font properties, i.e. encoded in the font program. Please note, that this information is not necessarily
correct or complete: fonts quite often contain wrong data here.
Key Value
origin character’s left baseline point, point_like
bbox character rectangle, rect_like
c the character (unicode)
This image shows the relationship between a character’s bbox and its quad:
6.19 TextWriter
(New in v1.16.18)
This class represents a MuPDF text object. The basic idea is to decouple (1) text preparation, and (2) text output to
PDF pages.
During preparation, a text writer stores any number of text pieces (“spans”) together with their positions and individ-
ual font information. The output of the writer’s prepared content may happen multiple times to any PDF page with a
compatible page size.
A text writer is an elegant alternative to methods Page.insert_text() and friends:
• Improved text positioning: Choose any point where insertion of text should start. Storing text returns the
“cursor position” after the last character of the span.
• Free font choice: Each text span has its own font and fontsize. This lets you easily switch when composing a
larger text.
• Automatic fallback fonts: If a character is not supported by the chosen font, alternative fonts are automatically
searched. This significantly reduces the risk of seeing unprintable symbols in the output (“TOFUs” – looking
like a small rectangle). PyMuPDF now also comes with the universal font “Droid Sans Fallback Regular”,
which supports all Latin characters (incuding Cyrillic and Greek), and all CJK characters (Chinese, Japanese,
Korean).
• Cyrillic and Greek Support: The PDF Base 14 Fonts have integrated support of Cyrillic and Greek characters
without specifying encoding. Your text may be a mixture of Latin, Greek and Cyrillic.
• Transparency support: Parameter opacity is supported. This offers a handy way to create watermark-style
text.
• Justified text: Supported for any font – not just simple fonts as in Page.insert_textbox().
• Reusability: A TextWriter object exists independent from PDF pages. It can be written multiple times, either
to the same or to other pages, in the same or in different PDFs, choosing different colors or transparency.
Using this object entails three steps:
1. When created, a TextWriter requires a fixed page rectangle in relation to which it calculates text positions. A
text writer can write to pages of this size only.
2. Store text in the TextWriter using methods TextWriter.append(), TextWriter.appendv() and
TextWriter.fill_textbox() as often as is desired.
3. Output the TextWriter object on some PDF page(s).
Note:
• Starting with version 1.17.0, TextWriters do support text rotation via the morph parameter of TextWriter.
write_text().
• There also exists Page.write_text() which combines one or more TextWriters and jointly writes them to
a given rectangle and with a given rotation angle – much like Page.show_pdf_page().
Class API
class TextWriter
character (this font or the fallback font) will be taken. The fallback font will never
return small caps. For example, this snippet:
>>> tw = fitz.TextWriter(page.rect)
>>> tw.append((50,100), text, font=font, small_caps=True)
>>> tw.write_text(page)
>>> doc.ez_save("x.pdf")
Note: Use these methods as often as is required – there is no technical limit (except memory constraints of your
system). You can also mix append() and text boxes and have multiple of both. Text positioning is exclusively
controlled by the insertion point. Therefore there is no need to adhere to any order. (Changed in v1.18.0:) Raise
an exception for an unsupported font – checked via Font.is_writable.
(invisible).
text_rect
The area currently occupied.
Return type Rect
last_point
The “cursor position” – a Point – after the last written character (its bottom-right).
Return type Point
opacity
The text opacity (modifyable).
Return type float
color
The text color (modifyable).
Return type float,tuple
rect
The page rectangle for which this TextWriter was created. Must not be modified.
Return type Rect
Note: To see some demo scripts dealing with TextWriter, have a look at this repository.
1. Opacity and color apply to all the text in this object.
2. If you need different colors / transpareny, you must create a separate TextWriter. Whenever you determine
the color should change, simply append the text to the respective TextWriter using the previously returned
last_point as position for the new text span.
3. Appending items or text boxes can occur in arbitrary order: only the position parameter controls where text
appears.
4. Font and fontsize can freely vary within the same TextWriter. This can be used to let text with different properties
appear on the same displayed line: just specify pos accordingly, and e.g. set it to last_point of the previously
added item.
5. You can use the pos argument of TextWriter.fill_textbox() to set the position of the first text char-
acter. This allows filling the same textbox with contents from different TextWriter objects, thus allowing for
multiple colors, opacities, etc.
6. MuPDF does not support all fonts with this feature, e.g. no Type3 fonts. Starting with v1.18.0 this can be
checked via the font attribute Font.is_writable. This attribute is also checked when using TextWriter
methods.
6.20 Tools
This class is a collection of utility methods and attributes, mainly around memory management. To simplify and speed
up its use, it is automatically instantiated under the name TOOLS when PyMuPDF is imported.
Class API
class Tools
gen_id()
A convenience method returning a unique positive integer which will increase by 1 on every invocation.
Example usages include creating unique keys in databases - its creation should be faster than using times-
tamps by an order of magnitude.
Note: MuPDF has dropped support for this in v1.14.0, so we have re-implemented a similar function
with the following differences:
• It is not part of MuPDF’s global context and not threadsafe (not an issue because we do not support
threads in PyMuPDF anyway).
• It is implemented as int. This means that the maximum number is sys.maxsize. Should this number
ever be exceeded, the counter starts over again at 1.
set_annot_stem(stem=None)
(New in v1.18.6)
Set or inquire the prefix for the id of new annotations, fields or links.
1 This memory area is internally used by MuPDF, and it serves as a cache for objects that have already been read and interpreted, thus improving
performance. The most bulky object types are images and also fonts. When an application starts up the MuPDF library (in our case this happens
as part of import fitz), it must specify a maximum size for this area. PyMuPDF’s uses the default value (256 MB) to limit memory consumption.
Use the methods here to control or investigate store usage. For example: even after a document has been closed and all related objects have been
deleted, the store usage may still not drop down to zero. So you might want to enforce that before opening another document.
Parameters stem (str) – if omitted, the current value is returned, default is “fitz”. An-
notations, fields / widgets and links technically are subtypes of the same type of ob-
ject (/Annot) in PDF documents. An /Annot object may be given a unique identifier
within a page. For each of the applicable subtypes, PyMuPDF generates identifiers
“stem-Annn”, “stem-Wnnn” or “stem-Lnnn” respectively. The number “nnn” is used to
enforce the required uniqueness.
Return type str
Returns the current value.
set_small_glyph_heights(on=None)
(New in v1.18.5)
Set or inquire reduced bbox heights in text extract and text search methods.
Parameters on (bool) – if omitted or None, the current setting is returned. For other
values the bool() function is applied to set a global variable. If True, Page.
search_for() and Page.get_text() methods return character, span, line or
block bboxes that have a height of font size. If False (standard setting when PyMuPDF
is imported), bbox height will be based on font properties and normally equal line
height.
Return type bool
Returns True or False.
Note: Text extraction options “xml”, “xhtml” and “html”, which directly wrap MuPDF code, are not
influenced by this.
set_subset_fontnames(on=None)
(New in v1.18.9)
Control suppression of subset fontname tags in text extractions.
Parameters on (bool) – if omitted / None, the current setting is returned. Arguments
evaluating to True or False set a global variable. If True, options “dict”, “json”,
“rawdict” and “rawjson” will return e.g. "NOHSJV+Calibri-Light", otherwise
only "Calibri-Light" (the default). The setting remains in effect until changed
again.
Return type bool
Returns True or False.
Note: Except mentioned above, no other text extraction variants are influenced by this. This is especially
true for the options “xml”, “xhtml” and “html”, which are based on MuPDF code. They extract the font
name "Calibri-Light", or even just the family name – Calibri in this example.
unset_quad_corrections(on=None)
(New in v1.18.10)
Enable / disable PyMuPDF-specific code, that tries to rebuild valid character quads when encountering
nonsense in Page.get_text() text extractions. This code depends on certain font properties (ascen-
der and descender), which do not exist in rare situations and cause segmentation faults when trying to
access them. This method sets a global parameter in PyMuPDF, which suppresses execution of this code.
Parameters on (bool) – if omitted or None, the current setting is returned. For other
values the bool() function is applied to set a global variable. If True, PyMuPDF
will not try to access the resp. font properties and use values ascender=0.8 and
descender=-0.2 instead.
Return type bool
Returns True or False.
image_profile(stream)
(New in v1.16.17) Show important properties of an image provided as a memory area. Its main purpose
is to avoid using other Python packages just to determine basic properties.
Parameters stream (bytes,bytearray) – the image data.
Return type dict
Returns a dictionary with the keys “width”, “height”, “xres”, “yres”, “colorspace” (the col-
orspace.n value, number of colorants), “cs-name” (the colorspace.name value), “bpc”,
“ext” (image type as file extension). The values for these keys are the same as returned
by Document.extract_image(). Please also have a look at resolution.
Note:
• For some “exotic” images (FAX encodings, RAW formats and the like), this method will not
work and return None. You can however still work with such images in PyMuPDF, e.g. by us-
ing Document.extract_image() or create pixmaps via Pixmap(doc, xref). These
methods will automatically convert exotic images to the PNG format before returning results.
• Some examples:
store_shrink(percent)
Reduce the storables cache by a percentage of its current size.
Parameters percent (int) – the percentage of current size to free. If 100+ the store will
be emptied, if zero, nothing will happen. MuPDF’s caching strategy is “least recently
used”, so low-usage elements get deleted first.
Return type int
Returns the new current store size. Depending on the situation, the size reduction may be
larger than the requested percentage.
show_aa_level()
(New in version 1.16.14) Return the current anti-aliasing values. These values control the rendering
quality of graphics and text elements.
Return type dict
Returns A dictionary with the following initial content: {'graphics': 8, 'text':
8, 'graphics_min_line_width': 0.0}.
set_aa_level(level)
(New in version 1.16.14) Set the new number of bits to use for anti-aliasing. The same value is taken
currently for graphics and text rendering. This might change in a future MuPDF release.
Parameters level (int) – an integer ranging between 0 and 8. Value outside this range
will be silently changed to valid values. The value will remain in effect throughout the
current session or until changed again.
reset_mupdf_warnings()
(New in version 1.16.0)
Empty MuPDF warnings message buffer.
mupdf_display_errors(value=None)
(New in version 1.16.8)
Show or set whether MuPDF errors should be displayed.
Parameters value (bool) – if not a bool, the current setting is returned. If true, MuPDF
errors will be shown on sys.stderr, otherwise suppressed. In any case, messages con-
tinue to be stored in the warnings store. Upon import of PyMuPDF this value is True.
Returns True or False
mupdf_warnings(reset=True)
(New in version 1.16.0)
Return all stored MuPDF messages as a string with interspersed line-breaks.
Parameters reset (bool) – (new in version 1.16.7) whether to automatically empty the
store.
fitz_config
A dictionary containing the actual values used for configuring PyMuPDF and MuPDF. Also refer to the
installation chapter. This is an overview of the keys, each of which describes the status of a support aspect.
store_maxsize
Maximum storables cache size in bytes. PyMuPDF is generated with a value of 268’435’456 (256 MB,
the default value), which you should therefore always see here. If this value is zero, then an “unlimited”
growth is permitted.
Return type int
store_size
Current storables cache size in bytes. This value may change (and will usually increase) with every use
of a PyMuPDF function. It will (automatically) decrease only when Tools.store_maxize is going
to be exceeded: in this case, MuPDF will evict low-usage objects until the value is again in range.
Return type int
::
6.21 Widget
This class represents a PDF Form field, also called a “widget”. Throughout this documentation, we are using these
terms synonymously. Fields technically are a special case of PDF annotations, which allow users with limited permis-
sions to enter information in a PDF. This is primarily used for filling out forms.
Like annotations, widgets live on PDF pages. Similar to annotations, the first widget on a page is accessible via
Page.first_widget and subsequent widgets can be accessed via the Widget.next property.
(Changed in version 1.16.0) MuPDF no longer treats widgets as a subset of general annotations. Consequently, Page.
first_annot and Annot.next() will deliver non-widget annotations exclusively, and be None if only form
fields exist on a page. Vice versa, Page.first_widget and Widget.next() will only show widgets. This
design decision is purely internal to MuPDF; technically, links, annotations and fields have a lot in common and also
continue to share the better part of their code within (Py-) MuPDF.
Class API
class Widget
button_states()
New in version 1.18.15
Return the names of On / Off (i.e. selected / clicked or not) states a button field may have.
While the ‘Off’ state usually is also named like so, the ‘On’ state is often given a name
relating to the functional context, for example ‘Yes’, ‘Female’, etc.
This method helps finding out the possible values of field_value in these cases.
returns a dictionary with the names of ‘On’ and ‘Off’ for the normal and the
pressed-down appearance of button widgets. Example:
update()
After any changes to a widget, this method must be used to store them in the PDF1 .
reset()
Reset the field’s value to its default – if defined – or remove it. Do not forget to issue update()
afterwards.
next
Point to the next form field on the page. The last widget returns None.
border_color
A list of up to 4 floats defining the field’s border color. Default value is None which causes border style
and border width to be ignored.
1 If you intend to re-access a new or updated field (e.g. for making a pixmap), make sure to reload the page first. Either close and re-open the
border_style
A string defining the line style of the field’s border. See Annot.border. Default is “s” (“Solid”) – a
continuous line. Only the first character (upper or lower case) will be regarded when creating a widget.
border_width
A float defining the width of the border line. Default is 1.
border_dashes
A list/tuple of integers defining the dash properties of the border line. This is only meaningful if bor-
der_style == “D” and border_color is provided.
choice_values
Python sequence of strings defining the valid choices of list boxes and combo boxes. For these widget
types, this property is mandatory and must contain at least two items. Ignored for other types.
field_name
A mandatory string defining the field’s name. No checking for duplicates takes place.
field_label
An optional string containing an “alternate” field name. Typically used for any notes, help on field usage,
etc. Default is the field name.
field_value
The value of the field.
field_flags
An integer defining a large amount of properties of a field. Be careful when changing this attribute as this
may change the field type.
field_type
A mandatory integer defining the field type. This is a value in the range of 0 to 6. It cannot be changed
when updating the widget.
field_type_string
A string describing (and derived from) the field type.
fill_color
A list of up to 4 floats defining the field’s background color.
button_caption
The caption string of a button-type field.
is_signed
A bool indicating the signing status of a signature field, else None.
rect
The rectangle containing the field.
text_color
A list of 1, 3 or 4 floats defining the text color. Default value is black ([0, 0, 0]).
text_font
A string defining the font to be used. Default and replacement for invalid values is “Helv”. For valid font
reference names see the table below.
text_fontsize
A float defining the text fontsize. Default value is zero, which causes PDF viewer software to dynamically
choose a size suitable for the annotation’s rectangle and text amount.
text_maxlen
An integer defining the maximum number of text characters. PDF viewers will (should) not accept a
longer text.
text_type
An integer defining acceptable text types (e.g. numeric, date, time, etc.). For reference only for the time
being – will be ignored when creating or updating widgets.
xref
The PDF xref of the widget.
script
(New in version 1.16.12) JavaScript text (unicode) for an action associated with the widget, or None. This
is the only script action supported for button type widgets.
script_stroke
(New in version 1.16.12) JavaScript text (unicode) to be performed when the user types a key-stroke into
a text field or combo box or modifies the selection in a scrollable list box. This action can check the
keystroke for validity and reject or modify it. None if not present.
script_format
(New in version 1.16.12) JavaScript text (unicode) to be performed before the field is formatted to display
its current value. This action can modify the field’s value before formatting. None if not present.
script_change
(New in version 1.16.12) JavaScript text (unicode) to be performed when the field’s value is changed. This
action can check the new value for validity. None if not present.
script_calc
(New in version 1.16.12) JavaScript text (unicode) to be performed to recalculate the value of this field
when that of another field changes. None if not present.
Note:
1. For adding or changing one of the above scripts, just put the appropriate JavaScript source code in the
widget attribute. To remove a script, set the respective attribute to None.
2. Button fields only support script. Other script entries will automatically be set to None.
Widgets use their own resources object /DR. A widget resources object must at least contain a /Font object. Widget
fonts are independent from page fonts. We currently support the 14 PDF base fonts using the following fixed reference
names, or any name of an already existing field font. When specifying a text font for new or changed widgets, either
choose one in the first table column (upper and lower case supported), or one of the already existing form fonts. In the
latter case, spelling must exactly match.
To find out already existing field fonts, inspect the list Document.FormFonts.
You are generally free to use any font for every widget. However, we recommend using ZaDb (“ZapfDingbats”)
and fontsize 0 for check boxes: typical viewers will put a correctly sized tickmark in the field’s rectangle, when it is
clicked.
PyMuPDF supports the creation and update of many, but not all widget types.
• text (PDF_WIDGET_TYPE_TEXT)
• push button (PDF_WIDGET_TYPE_BUTTON)
• check box (PDF_WIDGET_TYPE_CHECKBOX)
• combo box (PDF_WIDGET_TYPE_COMBOBOX)
• list box (PDF_WIDGET_TYPE_LISTBOX)
• radio button (PDF_WIDGET_TYPE_RADIOBUTTON): PyMuPDF does not currently support groups of (inter-
connected) buttons, where setting one automatically unsets the other buttons in the group. The widget object
also does not reflect the presence of a button group. Setting or unsetting happens via values True and False
and will always work without affecting other radio buttons.
• signature (PDF_WIDGET_TYPE_SIGNATURE) read only.
Instances of classes Point, IRect, Rect and Matrix are collectively also called “geometry” objects.
They all are special cases of Python sequences, see Using Python Sequences as Arguments in PyMuPDF for more
background.
We have defined operators for these classes that allow dealing with them (almost) like ordinary numbers in terms of
addition, subtraction, multiplication, division, and some others.
This chapter is a synopsis of what is possible.
271
PyMuPDF Documentation, Release 1.19.3
Oper. Result
bool(OBJ) is false exactly if all components of OBJ are zero
abs(OBJ) the rectangle area – equal to norm(OBJ) for the other tyes
norm(OBJ) square root of the component squares (Euclidean norm)
+OBJ new copy of OBJ
-OBJ new copy of OBJ with negated components
~m inverse of matrix “m”, or the null matrix if not invertible
For every geometry object “a” and every number “b”, the operations “a ° b” and “a °= b” are always defined for the
operators +, -, *, /. The respective operation is simply executed for each component of “a”. If the second operand is
not a number, then the following is defined:
Oper. Result
a+b, component-wise execution, “b” must be “a-like”.
a-b
a*m, “a” can be a point, rectangle or matrix, but “m” must be matrix_like. “a/m” is treated as “a*~m” (see
a/m note below for non-invertible matrices). If “a” is a point or a rectangle, then “a.transform(m)” is executed.
If “a” is a matrix, then matrix concatenation takes place.
a&b intersection rectangle: “a” must be a rectangle and “b” rect_like. Delivers the largest rectangle
contained in both operands.
a|b union rectangle: “a” must be a rectangle, and “b” may be point_like or rect_like. Delivers the
smallest rectangle containing both operands.
b in if “b” is a number, then “b in tuple(a)” is returned. If “b” is point_like or rect_like, then “a” must
a be a rectangle, and “a.contains(b)” is returned.
a True if bool(a-b) is False (“b” may be “a-like”).
==
b
For the usual arithmetic operations, numbers are always allowed as second operand. In addition, you can formulate “x
in OBJ”, where x is a number. It is implemented as “x in tuple(OBJ)”:
>>> fitz.Rect(1, 2, 3, 4) + 5
fitz.Rect(6.0, 7.0, 8.0, 9.0)
>>> 3 in fitz.Rect(1, 2, 3, 4)
True
>>>
The following will create the upper left quarter of a document page rectangle:
>>> page.rect
Rect(0.0, 0.0, 595.0, 842.0)
>>> page.rect / 2
Rect(0.0, 0.0, 297.5, 421.0)
>>>
The following will deliver the middle point of a line connecting two points p1 and p2:
>>> p1 = fitz.Point(1, 2)
>>> p2 = fitz.Point(4711, 3141)
>>> mp = (p1 + p2) / 2
>>> mp
Point(2356.0, 1571.5)
>>>
The second operand of a binary operation can always be “like” the left operand. “Like” in this context means “a
sequence of numbers of the same length”. With the above examples:
>>> p1 + p2
Point(4712.0, 3143.0)
>>> p1 + (4711, 3141)
Point(4712.0, 3143.0)
>>> p1 += (4711, 3141)
>>> p1
Point(4712.0, 3143.0)
>>>
Points, rectangles and matrices can be transformed with matrices. In PyMuPDF, we treat this like a “multiplication”
(or resp. “division”), where the second operand may be “like” a matrix. Division in this context means “multiplication
with the inverted matrix”:
>>> m = fitz.Matrix(1, 2, 3, 4, 5, 6)
>>> n = fitz.Matrix(6, 5, 4, 3, 2, 1)
>>> p = fitz.Point(1, 2)
>>> p * m
Point(12.0, 16.0)
>>> p * (1, 2, 3, 4, 5, 6)
Point(12.0, 16.0)
>>> p / m
Point(2.0, -2.0)
>>> p / (1, 2, 3, 4, 5, 6)
Point(2.0, -2.0)
>>>
>>> m * n # matrix multiplication
Matrix(14.0, 11.0, 34.0, 27.0, 56.0, 44.0)
>>> m / n # matrix division
Matrix(2.5, -3.5, 3.5, -4.5, 5.5, -7.5)
>>>
>>> m / m # result is equal to the Identity matrix
Matrix(1.0, 0.0, 0.0, 1.0, 0.0, 0.0)
>>>
>>> # look at this non-invertible matrix:
>>> m = fitz.Matrix(1, 0, 1, 0, 1, 0)
>>> ~m
Matrix(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
>>> # we try dividing by it in two ways:
>>> p = fitz.Point(1, 2)
>>> p * ~m # this delivers point (0, 0):
Point(0.0, 0.0)
>>> p / m # but this is an exception:
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
p / m
File "... /site-packages/fitz/fitz.py", line 869, in __truediv__
raise ZeroDivisionError("matrix not invertible")
ZeroDivisionError: matrix not invertible
>>>
Contains a number of functions and classes for the experienced user. To be used for special needs or performance
requirements.
8.1 Functions
The following are miscellaneous functions and attributes on a fairly low-level technical detail.
Some functions provide detail access to PDF structures. Others are stripped-down, high performance versions of other
functions which provide more information.
Yet others are handy, general-purpose utilities.
277
PyMuPDF Documentation, Release 1.19.3
paper_size(s)
Convenience function to return width and height of a known paper format code. These values are
given in pixels for the standard resolution 72 pixels = 1 inch.
Currently defined formats include ‘A0’ through ‘A10’, ‘B0’ through ‘B10’, ‘C0’ through ‘C10’,
‘Card-4x6’, ‘Card-5x7’, ‘Commercial’, ‘Executive’, ‘Invoice’, ‘Ledger’, ‘Legal’, ‘Legal-13’,
‘Letter’, ‘Monarch’ and ‘Tabloid-Extra’, each in either portrait or landscape format.
A format name must be supplied as a string (case in sensitive), optionally suffixed with “-L” (land-
scape) or “-P” (portrait). No suffix defaults to portrait.
Parameters s (str) – any format name from above in upper or lower case, like “A4”
or “letter-l”.
Return type tuple
Returns (width, height) of the paper format. For an unknown format (-1, -
paper_rect(s)
Convenience function to return a Rect for a known paper format.
Parameters s (str) – any format name supported by paper_size().
Return type Rect
Returns fitz.Rect(0, 0, width, height) with width, height=fitz.paper_size(s).
>>> import fitz
>>> fitz.paper_rect("letter-l")
fitz.Rect(0.0, 0.0, 792.0, 612.0)
>>>
sRGB_to_pdf(srgb)
New in v1.17.4
Convenience function returning a PDF color triple (red, green, blue) for a given sRGB color integer
as it occurs in Page.get_text() dictionaries “dict” and “rawdict”.
Parameters srgb (int) – an integer of format RRGGBB, where each color compo-
nent is an integer in range(255).
Returns a tuple (red, green, blue) with float items in intervall 0 <= item <= 1 rep-
resenting the same color. Example sRGB_to_pdf(0xff0000) = (1, 0,
0) (red).
sRGB_to_rgb(srgb)
New in v1.17.4
Convenience function returning a color (red, green, blue) for a given sRGB color integer.
Parameters srgb (int) – an integer of format RRGGBB, where each color compo-
nent is an integer in range(255).
Returns a tuple (red, green, blue) with integer items in range(256) representing
the same color. Example sRGB_to_pdf(0xff0000) = (255, 0, 0)
(red).
glyph_name_to_unicode(name)
New in v1.18.0
Return the unicode number of a glyph name based on the Adobe Glyph List.
Parameters name (str) – the name of some glyph. The function is based on the
Adobe Glyph List.
Return type int
Returns the unicode. Invalid name entries return 0xfffd (65533).
unicode_to_glyph_name(ch)
New in v1.18.0
Return the glyph name of a unicode number, based on the Adobe Glyph List.
Parameters ch (int) – the unicode given by e.g. ord("ß"). The function is based
on the Adobe Glyph List.
Return type str
Returns the glyph name. E.g. fitz.unicode_to_glyph_name(ord("Ä"))
returns 'Adieresis'.
adobe_glyph_names()
New in v1.18.0
Return a list of glyph names defined in the Adobe Glyph List.
Return type list
Returns list of strings.
adobe_glyph_unicodes()
New in v1.18.0
Return a list of unicodes for there exists a glyph name in the Adobe Glyph List.
Return type list
Returns list of integers.
recover_quad(line_dir, span)
New in v1.18.9
Convenience function returning the quadrilateral envelopping the text of a text span, as returned by
Page.get_text() using the “dict” or “rawdict” options.
Parameters
• line_dict (tuple) – the value line["dir"] of the span’s line.
• span (dict) – the span sub-dictionary.
Returns the quadrilateral of the span’s text.
planish_line(p1, p2)
(New in version 1.16.2)
Return a matrix which maps the line from p1 to p2 to the x-axis such that p1 will become (0,0) and
p2 a point with the same distance to (0,0).
Parameters
• p1 (point_like) – starting point of the line.
• p2 (point_like) – end point of the line.
Return type Matrix
Returns
a matrix which combines a rotation and a translation:
>>> p1 = fitz.Point(1, 1)
>>> p2 = fitz.Point(4, 5)
>>> abs(p2 - p1) # distance of points
5.0
>>> m = fitz.planish_line(p1, p2)
>>> p1 * m
Point(0.0, 0.0)
>>> p2 * m
Point(5.0, -5.960464477539063e-08)
>>> # distance of the resulting points
>>> abs(p2 * m - p1 * m)
5.0
paper_sizes()
A dictionary of pre-defines paper formats. Used as basis for paper_size().
fitz_fontdescriptors
(New in v1.17.5)
A dictionary of usable fonts from repository pymupdf-fonts. Items are keyed by their reserved
fontname and provide information like this:
In [2]: fitz.fitz_fontdescriptors.keys()
Out[2]: dict_keys(['figbo', 'figo', 'figbi', 'figit', 'fimbo', 'fimo',
'spacembo', 'spacembi', 'spacemit', 'spacemo', 'math', 'music', 'symbol1
˓→',
'symbol2'])
In [3]: fitz.fitz_fontdescriptors["fimo"]
Out[3]:
{'name': 'Fira Mono Regular',
'size': 125712,
'mono': True,
'bold': False,
'italic': False,
'serif': True,
'glyphs': 1485}
get_pdf_now()
Convenience function to return the current local timestamp in PDF compatible format, e.g.
D:20170501121525-04’00’ for local datetime May 1, 2017, 12:15:25 in a timezone 4 hours west-
ward of the UTC meridian.
Return type str
Returns current local PDF timestamp.
Note: This function will only do the calculation – it won’t insert font nor text.
Note: The Font class offers a similar method, Font.text_length(), which supports Base-14
fonts and any font with a character map (CMap, Type 0 fonts).
Warning: If you use this function to determine the required rectangle width for the (Page or
Shape) insert_textbox methods, be aware that they calculate on a by-character level. Because
of rounding effects, this will mostly lead to a slightly larger number: sum([fitz.get_text_length(c)
for c in text]) > fitz.get_text_length(text). So either (1) do the same, or (2) use something like
fitz.get_text_length(text + “’”) for your calculation.
get_pdf_str(text)
Make a PDF-compatible string: if the text contains code points ord(c) > 255, then it will be con-
verted to UTF-16BE with BOM as a hexadecimal character string enclosed in “<>” brackets like
<feff. . . >. Otherwise, it will return the string enclosed in (round) brackets, replacing any characters
outside the ASCII range with some special code. Also, every “(“, “)” or backslash is escaped with
a backslash.
Parameters text (str) – the object to convert
Return type str
Returns PDF-compatible string enclosed in either () or <>.
image_properties(stream)
(New in version 1.14.14)
Return a number of basic properties for an image.
Key Value
width (int) width in pixels
height (int) height in pixels
colorspace (int) colorspace.n (e.g. 3 = RGB)
bpc (int) bits per component (usually 8)
format (int) image format in range(15)
ext (str) image file extension indicating the format
size (int) length of the image in bytes
Example:
>>> fitz.image_properties(open("img-clip.jpg","rb"))
{'bpc': 8, 'format': 9, 'colorspace': 3, 'height': 325, 'width': 244,
˓→'ext': 'jpeg', 'size': 14161}
>>>
ConversionHeader("text", filename="UNKNOWN")
Return the header string required to make a valid document out of page text outputs.
Parameters
• output (str) – type of document. Use the same as the output parameter
of get_text().
• filename (str) – optional arbitrary name to use in output types “json”
and “xml”.
Return type str
ConversionTrailer(output)
Return the trailer string required to make a valid document out of page text outputs. See Page.
get_text() for an example.
Parameters output (str) – type of document. Use the same as the output parame-
ter of get_text().
Return type str
Document.delete_object(xref )
PDF only: Delete an object given by its cross reference number.
Parameters xref (int) – the cross reference number. Must be within the docu-
ment’s valid xref range.
Warning: Only use with extreme care: this may make the PDF unreadable.
Document.del_xml_metadata()
Delete an object containing XML-based metadata from the PDF. (Py-) MuPDF does not support
XML-based metadata. Use this if you want to make sure that the conventional metadata dictionary
will be used exclusively. Many thirdparty PDF programs insert their own metadata in XML format
and thus may override what you store in the conventional dictionary. This method deletes any such
reference, and the corresponding PDF object will be deleted during next garbage collection of the
file.
Document.xml_metadata_xref()
Return the XML-based metadata xref of the PDF if present – also refer to Document.
del_xml_metadata(). You can use it to retrieve the content via Document.
xref_stream() and then work with it using some XML software.
Return type int
Returns xref of PDF file level XML metadata – or 0 if none exists.
Page.run(dev, transform)
Run a page through a device.
Parameters
• dev (Device) – Device, obtained from one of the Device constructors.
• transform (Matrix) – Transformation to apply to the page. Set it to Iden-
tity if no transformation is desired.
Page.get_bboxlog()
• New in v1.19.0
Returns
a list of rectangles that envelop text, image or drawing objects. Each item is a
tuple (type, (x0, y0, x1, y1)) where the second tuple consists of rectangle coordi-
nates, and type is one of the following values:
• "fill-text" – normal text (painted without character borders)
• "stroke-text" – text showing character borders only
• "ignore-text" – text that should not be displayed (e.g. as used by OCR
text layers)
• "fill-path" – drawing with fill color (and no border)
• "stroke-path" – drawing with border (and no fill color)
• "fill-image" – displays an image
• "fill-shade" – display a shading
The item sequence represents the sequence in which these commands are exe-
cuted to build the page’s appearance. Therefore, if an item’s bbox intersects or
contains that of a previous item, then the previous item may be (partially) covered
/ hidden.
So this list is useful to detect such situations. An item’s index in this list equals
the value of ‘‘”seqno”‘ keys you will find in the dictionaries returned by Page.
get_drawings() and Page.get_texttrace().
Page.get_texttrace()
• New in v1.18.16
• Changed in v1.19.0: added key “seqno”.
• Changed in v1.19.1: stroke and fill colors now always are either RGB or GRAY
• Changed in v1.19.3: span and character bboxes are now also correct if dir != (1, 0).
Return low-level text information of the page. The method is available for all document types. The
result is a list of Python dictionaries with the following content:
{
'ascender': 0.83251953125, # font ascender (1)
'bbox': (458.14019775390625, # span bbox x0 (7)
749.4671630859375, # span bbox y0
467.76458740234375, # span bbox x1
757.5071411132812), # span bbox y1
'bidi': 0, # bidirectional level (1)
'chars': ( # char information, tuple[tuple]
(45, # unicode (4)
16, # glyph id (font dependent)
(458.14019775390625, # origin.x (1)
755.3758544921875), # origin.y (1)
(458.14019775390625, # char bbox x0 (6)
749.4671630859375, # char bbox y0
462.9649963378906, # char bbox x1
757.5071411132812)), # char bbox y1
( ... ), # more characters
),
'color': (0.0,), # text color, tuple[float] (1)
'colorspace': 1, # number of colorspace components
˓→(1)
Details:
1. Information above tagged with “(1)” has the same meaning and value as explained in
TextPage.
• Please note that the font flags value will never contain a superscript flag bit: the
detection of superscripts is done within MuPDF TextPage code – it is not a property of
any font.
• Also note, that the text color is encoded as the usual tuple of floats 0 <= f <= 1 – not
in sRGB format. Depending on span["type"], interpret this as fill color or stroke
color.
2. There are 3 text span types:
• 0: Filled text – equivalent to PDF text rendering mode 0 (0 Tr, the default in PDF),
only each character’s “inside” is shown.
• 1: Stroked text – equivalent to 1 Tr, only the character borders are shown.
• 3: Ignored text – equivalent to 3 Tr (hidden text).
3. Line width in this context is important only for processing span["type"] != 0: it de-
termines the thickness of the character’s border line. This value may not be provided at all
with the text data. In this case, a value of 5% of the fontsize (span["size"] * 0,05) is
generated. Often, an “artificial” bold text in PDF is created by 2 Tr. There is no equivalent
span type for this case. Instead, respective text is represented by two consecutive spans –
which are identical in every aspect, except for their types, which are 0, resp 1. It is your
responsibility to handle this type of situation - in Page.get_text(), MuPDF is doing
this for you.
4. For data compactness, the character’s unicode is provided here. Use built-in function chr()
for the character itself.
5. The alpha / opacity value of the span’s text, 0 <= opacity <= 1, 0 is invisible text, 1
(100%) is intransparent. Depending in span["type"], interpret this value as fill opacity
or, resp. stroke opacity.
6. (Changd in v1.19.0) This value is equal or close to char["bbox"] of “rawdict”. In par-
ticular, the bbox height value is always computed as if “small glyph heights” had been
requested.
7. (New in v1.19.0) This is the union of all character bboxes.
8. (New in v1.19.0) Enumerates the commands that build up the page’s appearance. Can be
used to find out whether text is effectively hidden by objects, whch are painted “later”, or
over some object. So if there is a drawing or image with a higher sequence number, whose
bbox overlaps (parts of) this text span, one may assume that such an object hides the resp.
text. Different text spans have identical sequence numbers if they were created in one go.
Here is a list of similarities and differences of page.get_texttrace() compared to page.
get_text("rawdict"):
• The method is up to twice as fast, compared to “rawdict” extraction. Depends on the amount
of text.
• The returned data is very much smaller in size – although it provides more information.
• Additional types of text invisibility can be detected: opacity = 0 or type > 1 or overlapping
bbox of an object with a higher sequence number.
• If MuPDF returns unicode 0xFFFD (65533) for unrecognized characters, you may still be
able to deduct desired information from the glyph id.
• The span["chars"] contains no spaces, except the document creator has explicitely
coded them. They will never be generated like it happens in Page.get_text() methods.
To provide some help for doing your own computations here, the width of a space character
is given. This value is derived from the font where possible. Otherwise the value of a fallback
font is taken.
• There is no effort to organize text like it happens for a TextPage (the hierarchy of blocks,
lines, spans, and characters). Characters are simply extracted in sequence, one by one, and
put in a span. Whenever any of the span’s characteristics changes, a new span is started. So
you may find characters with different origin.y values in the same span (which means
they would appear in different lines). You cannot assume, that span characters are sorted
in any particular order – you must make sense of the info yourself, taking span["dir"],
span["wmode"], etc. into account.
• Ligatures are represented like this:
– MuPDF handles the following ligatures: “fi”, “ff”, “fl”, “ft”, “st”, “ffi”, and “ffl”
(only the first 3 are mostly ever used). If the page contains e.g. ligature “fi”, you
will find the following two character items subsequent to each other:
(102, glyph, (x, y), (x0, y0, x1, y1)) # 102 = ord("f")
(105, -1, (x, y), (x0, y0, x0, y1)) # 105 = ord("i"),
˓→empty bbox!
– This means that the bbox of the first ligature character is the area containing the
complete, compound glyph. Subsequent ligature components are recognizable by
their glyph value -1 and a bbox of width zero.
– You may want to replace those 2 or 3 char tuples by one, that represents the ligature
itself. Use the following mapping of ligatures to unicodes:
Page.wrap_contents()
Put string pair “q” / “Q” before, resp. after a page’s /Contents object(s) to ensure that any “geome-
try” changes are local only.
Page.is_wrapped
Indicate whether Page.wrap_contents() may be required for object insertions in standard
PDF geometry. Note that this is a quick, basic check only: a value of False may still be a false
alarm. But nevertheless executing Page.wrap_contents() will have no negative side effects.
Return type bool
Page.get_text_blocks(flags=None)
Deprecated wrapper for TextPage.extractBLOCKS(). Use Page.get_text() with the
“blocks” option instead.
Return type list[tuple]
Page.get_text_words(flags=None)
Deprecated wrapper for TextPage.extractWORDS(). Use Page.get_text() with the
“words” option instead.
Return type list[tuple]
Page.get_displaylist()
Run a page through a list device and return its display list.
Return type DisplayList
Returns the display list of the page.
Page.get_contents()
PDF only: Retrieve a list of xref of contents objects of a page. May be empty or contain mul-
tiple integers. If the page is cleaned (Page.clean_contents()), it will be one entry at most.
The “source” of each /Contents object can be individually read by Document.xref_stream()
using an item of this list. Method Page.read_contents() in contrast walks through this list
and concatenates the corresponding sources into one bytes object.
Return type list[int]
Page.set_contents(xref )
PDF only: Let the page’s /Contents key point to this xref. Any previously used contents objects
will be ignored and can be removed via garbage collection.
Page.clean_contents(sanitize=True)
(Changed in v1.17.6)
PDF only: Clean and concatenate all contents objects associated with this page. “Cleaning”
includes syntactical corrections, standardizations and “pretty printing” of the contents stream. Dis-
crepancies between contents and resources objects will also be corrected if sanitize is true.
See Page.get_contents() for more details.
Changed in version 1.16.0 Annotations are no longer implicitely cleaned by this method. Use
Annot.clean_contents() separately.
Parameters sanitize (bool) – (new in v1.17.6) if true, synchronization between
resources and their actual use in the contents object is snychronized. For example,
if a font is not actually used for any text of the page, then it will be deleted from
the /Resources/Font object.
Warning: This is a complex function which may generate large amounts of new data and
render old data unused. It is not recommended using it together with the incremental save
option. Also note that the resulting singleton new /Contents object is uncompressed. So you
should save to a new file using options “deflate=True, garbage=3”.
Page.read_contents()
New in version 1.17.0. Return the concatenation of all contents objects associated with the page
– without cleaning or otherwise modifying them. Use this method whenever you need to parse this
source in its entirety whithout having to bother how many separate contents objects exist.
Return type bytes
Annot.clean_contents(sanitize=True)
Clean the contents streams associated with the annotation. This is the same type of action which
Page.clean_contents() performs – just restricted to this annotation.
Document.get_char_widths(xref=0, limit=256)
Return a list of character glyphs and their widths for a font that is present in the document. A font
must be specified by its PDF cross reference number xref. This function is called automatically
from Page.insert_text() and Page.insert_textbox(). So you should rarely need to
do this yourself.
Parameters
• xref (int) – cross reference number of a font embedded in the PDF. To
find a font xref, use e.g. doc.get_page_fonts(pno) of page number pno and
take the first entry of one of the returned list entries.
• limit (int) – limits the number of returned entries. The default of 256 is
enforced for all fonts that only support 1-byte characters, so-called “simple
fonts” (checked by this method). All PDF Base 14 Fonts are simple fonts.
Return type list
Returns a list of limit tuples. Each character c has an entry (g, w) in this list with
an index of ord(c). Entry g (integer) of the tuple is the glyph id of the character,
and float w is its normalized width. The actual width for some fontsize can be
calculated as w * fontsize. For simple fonts, the g entry can always be safely
ignored. In all other cases g is the basis for graphically representing c.
Document.is_stream(xref )
(New in version 1.14.14)
PDF only: Check whether the object represented by xref is a stream type. Return is False if
not a PDF or if the number is outside the valid xref range.
Parameters xref (int) – xref number.
Returns True if the object definition is followed by data wrapped in keyword pair
stream, endstream.
Document.get_new_xref()
Increase the xref by one entry and return that number. This can then be used to insert a new
object.
Return type int :returns: the number of the new xref entry. Please note, that only a
new entry in the PDF’s cross reference table is created. At this point, there will
not yet exist a PDF object associated with it. To create an (empty) object with
this number use doc.update_xref(xref, "<<>>").
Document.xref_length()
Return length of xref table.
Return type int
Returns the number of entries in the xref table.
recover_quad(line_dir, span)
Compute the quadrilateral of a text span extracted via options “dict” or “rawdict” of Page.
get_text().
Parameters
• line_dir (tuple) – line["dir"] of the owning line. Use None for
a span from Page.get_texttrace().
• span (dict) – the span.
Returns the Quad of the span, usable for text marker annotations (‘Highlight’, etc.).
recover_line_quad(line, spans=None)
Compute the quadrilateral of a subset of spans of a text line extracted via options “dict” or “rawdict”
of Page.get_text().
Parameters
• line (dict) – the line.
• spans (list) – a sub-list of line["spans"]. If omitted, the full line
quad will be returned.
Returns the Quad of the selected line spans, usable for text marker annotations
(‘Highlight’, etc.).
INFINITE_QUAD()
INFINITE_RECT()
INFINITE_IRECT()
Return the (unique) infinite rectangle Rect(-2147483648.0, -2147483648.0,
2147483520.0, 2147483520.0), resp. the IRect and Quad counterparts. It is the
largest possible rectangle: all valid rectangles are contained in it.
EMPTY_QUAD()
EMPTY_RECT()
EMPTY_IRECT()
Return the “standard” empty and invalid rectangle Rect(2147483520.0, 2147483520.0,
-2147483648.0, -2147483648.0) resp. quad. Its top-left and bottom-right point values
are reversed compared to the infinite rectangle. It will e.g. be used to indicate empty bboxes in
page.get_text("dict") dictionaries. There are however infinitely many empty or invalid
rectangles.
8.2 Device
The different format handlers (pdf, xps, etc.) interpret pages to a “device”. Devices are the basis for everything that
can be done with a page: rendering, text extraction and searching. The device type is determined by the selected
construction method.
Class API
class Device
A DisplayList represents an interpreted document page. Methods for pixmap creation, text extraction and text search
are – behind the curtain – all using the page’s display list to perform their tasks. If a page must be rendered several
times (e.g. because of changed zoom levels), or if text search and text extraction should both be performed, overhead
can be saved, if the display list is created only once and then used for all other tasks.
You can also create display lists for many pages “on stack” (in a list), may be during document open, during idling
times, or you store it when a page is visited for the first time (e.g. in GUI scripts).
Note, that for everything what follows, only the display list is needed – the corresponding Page object could have been
deleted.
The following creates a Pixmap from a DisplayList. Parameters are the same as for Page.get_pixmap().
The execution time of this statement may be up to 50% shorter than that of Page.get_pixmap().
With the display list from above, we can also search for text.
For this we need to create a TextPage.
With the same TextPage object from above, we can now immediately use any or all of the 5 text extraction methods.
Note: Above, we have created our text page without argument. This leads to a default argument of 3 (ligatures
and white-space are preserved), IAW images will not be extracted – see below.
8.3.5.1 Pixmap
8.3.5.2 TextPage
If you do not need images extracted alongside the text of a page, you can set the following option:
This will save ca. 25% overall execution time for the HTML, XHTML and JSON text extractions and hugely reduce
the amount of storage (both, memory and disk space) if the document is graphics oriented.
If you however do need images, use a value of 7 for flags:
Glossary
matrix_like
A Python sequence of 6 numbers.
rect_like
A Python sequence of 4 numbers.
irect_like
A Python sequence of 4 integers.
point_like
A Python sequence of 2 numbers.
quad_like
A Python sequence of 4 point_like items.
inheritable
A number of values in a PDF can inherited by objects further down in a parent-child relationship. The mediabox
(physical size) of pages may for example be specified only once or in some node(s) of the pagetree and will
then be taken as value for all kids, that do not specify their own value.
MediaBox
A PDF array of 4 floats specifying a physical page size – (inheritable, mandatory). This rectangle should
contain all other PDF – optional – page rectangles, which may be specified in addition: CropBox, TrimBox,
ArtBox and BleedBox. Please consult Adobe PDF References for details. The MediaBox is the only rectangle,
for which there is no difference between MuPDF and PDF coordinate systems: Page.mediabox will always
show the same coordinates as the /MediaBox key in a page’s object definition. For all other rectangles,
MuPDF transforms coordinates such that the top-left corner is the point of reference. This can sometimes be
confusing – you may for example encounter a situation like this one:
• The page definition contains the following identical values: /MediaBox [ 36 45 607.5 765 ],
/CropBox [ 36 45 607.5 765 ].
• PyMuPDF accordingly shows page.mediabox = Rect(36.0, 45.0, 607.5, 765.0).
• BUT: page.cropbox = Rect(36.0, 0.0, 607.5, 720.0), because the two y-coordinates
have been transformed (45 subtracted from both of them).
CropBox
A PDF array of 4 floats specifying a page’s visible area – (inheritable, optional). It is the default for
297
PyMuPDF Documentation, Release 1.19.3
TrimBox, ArtBox and BleedBox. If not present, it defaults to MediaBox. This value is not affected if the page
is rotated – in contrast to Page.rect. Also, other than the page rectangle, the top-left corner of the cropbox
may or may not be (0, 0).
catalog
A central PDF dictionary – also called the “root” – containing document-wide parameters and pointers to
many other information. Its xref is returned by Document.pdf_catalog().
trailer
More precisely, the PDF trailer contains information in dictionary format. It is ususally located at the
file’s end. In this dictionary, you will find things like the xrefs of the catalog and the metadata, the number of
xref numbers, etc. Here is the definition of the PDF spec:
“The trailer of a PDF file enables an application reading the file to quickly find the cross-reference table and
certain special objects. Applications should read a PDF file from its end.”
To access the trailer in PyMuPDF, use the usual methods Document.xref_object(), Document.
xref_get_key() and Document.xref_get_keys() with -1 instead of a positive xref number.
contents
A content stream is a PDF object with an attached stream, whose data consists of a sequence of instruc-
tions describing the graphical elements to be painted on a page, see “Stream Objects” on page 19 of Adobe PDF
References. For an overview of the mini-language used in these streams, see chapter “Operator Summary” on
page 643 of the Adobe PDF References. A PDF page can have none to many contents objects. If it has none,
the page is empty (but still may show annotations). If it has several, they will be interpreted in sequence as if
their instructions had been present in one such object (i.e. like in a concatenated string). It should be noted
that there are more stream object types which use the same syntax: e.g. appearance dictionaries associated with
annotations and Form XObjects.
PyMuPDF provides a number of methods to deal with contents of PDF pages:
• Page.read_contents() – reads and concatenates all page contents into one bytes object.
• Page.clean_contents() – a wrapper of a MuPDF function that reads, concatenates and syntax-
cleans all page contents. After this, only one /Contents object will exist. In addition, page
resources will have been synchronized with it such that it will contain exactly those images, fonts
and other objects that the page actually references.
• Page.get_contents() – return a list of xref numbers of a page’s contents objects. May be
empty. Use Document.xref_stream() with one of these xrefs to read the resp. contents section.
• Page.set_contents() – set a page’s /Contents key to the provided xref number.
resources
A dictionary containing references to any resources (like images or fonts) required by a PDF page (re-
quired, inheritable, Adobe PDF References p. 81) and certain other objects (Form XObjects). This dictionary
appears as a sub-dictionary in the object definition under the key /Resources. Being an inheritable object type,
there may exist “parent” resources for all pages or certain subsets of pages.
dictionary
A PDF object type, which is somewhat comparable to the same-named Python notion: “A dictionary object
is an associative table containing pairs of objects, known as the dictionary’s entries. The first element of each
entry is the key and the second element is the value. The key must be a name (. . . ). The value can be any kind
of object, including another dictionary. A dictionary entry whose value is null (. . . ) is equivalent to an absent
entry.” (Adobe PDF References p. 18).
Dictionaries are the most important object type in PDF. Here is an example (describing a page):
<<
/Contents 40 0 R % value: an indirect object
/Type/Page % value: a name object
/MediaBox[0 0 595.32 841.92] % value: an array object
(continues on next page)
Contents, Type, MediaBox, etc. are keys, 40 0 R, Page, [0 0 595.32 841.92], etc. are the respective values. The
strings “<<” and “>>” are used to enclose object definitions.
This example also shows the syntax of nested dictionary values: Resources has an object as its value, which in
turn is a dictionary with keys like ExtGState (with the value <</R7 26 0 R>>, which is another dictionary), etc.
page
A PDF page is a dictionary object which defines one page in a PDF, see Adobe PDF References p. 71.
pagetree
The pages of a document are accessed through a structure known as the page tree, which defines the ordering
of pages in the document. The tree structure allows PDF consumer applications, using only limited memory,
to quickly open a document containing thousands of pages. The tree contains nodes of two types: intermediate
nodes, called page tree nodes, and leaf nodes, called page objects. (Adobe PDF References p. 75).
While it is possible to list all page references in just one array, PDFs with many pages are often created using
balanced tree structures (“page trees”) for faster access to any single page. In relation to the total number of
pages, this can reduce the average page access time by page number from a linear to some logarithmic order of
magnitude.
For fast page access, MuPDF can use its own array in memory – independently from what may or may not be
present in the document file. This array is indexed by page number and therefore much faster than even the
access via a perfectly balanced page tree.
object
Similar to Python, PDF supports the notion object, which can come in eight basic types: boolean values, in-
teger and real numbers, strings, names, arrays, dictionaries, streams, and the null object (Adobe PDF Refer-
ences p. 13). Objects can be made identifyable by assigning a label. This label is then called indirect object.
PyMuPDF supports retrieving definitions of indirect objects via their cross reference number via Document.
xref_object().
stream
A PDF object type which is followed by a sequence of bytes, similar to a Python string or rather bytes. “How-
ever, a PDF application can read a stream incrementally, while a string must be read in its entirety. Furthermore,
a stream can be of unlimited length, whereas a string is subject to an implementation limit. For this reason, ob-
jects with potentially large amounts of data, such as images and page descriptions, are represented as streams.”
“A stream consists of a dictionary followed by zero or more bytes bracketed between the keywords stream
and endstream”:
nnn 0 obj
<<
dictionary definition
>>
stream
(continues on next page)
299
PyMuPDF Documentation, Release 1.19.3
See Adobe PDF References p. 19. PyMuPDF supports retrieving stream content via Document.
xref_stream(). Use Document.is_stream() to determine whether an object is of stream type.
unitvector
A mathematical notion meaning a vector of norm (“length”) 1 – usually the Euclidean norm is implied. In
PyMuPDF, this term is restricted to Point objects, see Point.unit.
xref
Abbreviation for cross-reference number: this is an integer unique identification for objects in a PDF. There
exists a cross-reference table (which may physically consist of several separate segments) in each PDF, which
stores the relative position of each object for quick lookup. The cross-reference table is one entry longer than
the number of existing object: item zero is reserved and must not be used in any way. Many PyMuPDF classes
have an xref attribute (which is zero for non-PDFs), and one can find out the total number of objects in a PDF
via Document.xref_length() - 1.
resolution
Images and Pixmap objects may contain resolution information provided as “dots per inch”, dpi, in each di-
rection (horizontal and vertical). When MuPDF reads an image form a file or from a PDF object, it will parse
this information and put it in Pixmap.xres, Pixmap.yres, respectively. When it finds not meaningful
information in the input (like non-positive values or values exceeding 4800), it will use “sane” defaults instead.
The usual default value is 96, but it may also be 72 in some cases (e.g. for JPX images).
OCPD
Optional content properties dictionary - a sub dictionary of the PDF catalog. The central place to store
optional content information, which is identified by the key /OCProperties. This dictionary has two required and
one optional entry: (1) /OCGs, required, an array listing all optional content groups, (2) /D, required, the default
optional content configuration dictionary (OCCD), (3) /Configs, optional, an array of alternative OCCDs.
OCCD
Optional content configuration dictionary - a PDF dictionary inside the PDF OCPD. It stores a setting of ON
/ OFF states of OCGs and how they are presented to a PDF viewer program. Selecting a configuration is quick
way to achieve temporary mass visibility state changes. After opening a PDF, the /D configuration of the OCPD
is always activated. Viewer should offer a way to switch between the /D, or one of the optional configurations
contained in array /Configs.
OCG
Optional content group – a dictionary object used to control the visibility of other PDF objects like images
or annotations. Independently on which page they are defined, objects with the same OCG can simultaneously
be shown or hidden by setting their OCG to ON or OFF. This can be achieved via the user interface provided by
many PDF viewers (Adobe Acrobat), or programmatically.
OCMD
Optional content membership dictionary – a dictionary object which can be used like an OCG: it has a
visibility state. The visibility of an OCMD is computed: it is a logical expression, which uses the state of one
or more OCGs to produce a boolean value. The expression’s result is interpreted as ON (true) or OFF (false).
ligature
Some frequent character combinations are represented by their own special glyphs in more advanced fonts.
Typical examples are “fi”, “fl”, “ffi” and “ffl”. These compounds are called ligatures.In PyMuPDF text extrac-
tions there is the option to either return the corresponding unicode unchanged, or split ligatures up into their
constituent parts: “fi” ==> “f” + “i”, etc.
Constants and enumerations of MuPDF as implemented by PyMuPDF. Each of the following variables is accessible
as fitz.variable.
10.1 Constants
Base14_Fonts
Predefined Python list of valid PDF Base 14 Fonts.
Return type list
csRGB
Predefined RGB colorspace fitz.Colorspace(fitz.CS_RGB).
Return type Colorspace
csGRAY
Predefined GRAY colorspace fitz.Colorspace(fitz.CS_GRAY).
Return type Colorspace
csCMYK
Predefined CMYK colorspace fitz.Colorspace(fitz.CS_CMYK).
Return type Colorspace
CS_RGB
1 – Type of Colorspace is RGBA
Return type int
CS_GRAY
2 – Type of Colorspace is GRAY
Return type int
CS_CMYK
3 – Type of Colorspace is CMYK
Return type int
301
PyMuPDF Documentation, Release 1.19.3
VersionBind
‘x.xx.x’ – version of PyMuPDF (these bindings)
Return type string
VersionFitz
‘x.xxx’ – version of MuPDF
Return type string
VersionDate
ISO timestamp YYYY-MM-DD HH:MM:SS when these bindings were built.
Return type string
Note: The docstring of fitz contains information of the above which can be retrieved like so: print(fitz.__doc__), and
should look like: PyMuPDF 1.10.0: Python bindings for the MuPDF 1.10 library, built on 2016-11-30 13:09:13.
version
(VersionBind, VersionFitz, timestamp) – combined version information where timestamp is the generation point
in time formatted as “YYYYMMDDhhmmss”.
Return type tuple
Code Meaning
PDF_ENCRYPT_KEEP do not change
PDF_ENCRYPT_NONE remove any encryption
PDF_ENCRYPT_RC4_40 RC4 40 bit
PDF_ENCRYPT_RC4_128 RC4 128 bit
PDF_ENCRYPT_AES_128 Advanced Encryption Standard 128 bit
PDF_ENCRYPT_AES_256 Advanced Encryption Standard 256 bit
PDF_ENCRYPT_UNKNOWN unknown
The table show file extensions you should use when extracting fonts from a PDF file.
Ext Description
ttf TrueType font
pfa Postscript for ASCII font (various subtypes)
cff Type1C font (compressed font equivalent to Type1)
cid character identifier font (postscript format)
otf OpenType font
n/a built-in font (PDF Base 14 Fonts or CJK: cannot be extracted)
Option bits controlling the amount of data, that are parsed into a TextPage – this class is mainly used only internally in
PyMuPDF.
For the PyMuPDF programmer, some combination (using Python’s | operator, or simply use +) of these values are
aggregated in the flags integer, a parameter of all text search and text extraction methods. Depending on the
individual method, different default combinations of the values are used. Please use a value that meets your situation.
Especially make sure to switch off image extraction unless you really need them. The impact on performance and
memory is significant!
TEXT_PRESERVE_LIGATURES
1 – If set, ligatures are passed through to the application in their original form. Otherwise ligatures are expanded
into their constituent parts, e.g. the ligature “ffi” is expanded into three eparate characters f, f and i. Default is
“on” in PyMuPDF. MuPDF supports the following 7 ligatures: “ff”, “fi”, “fl”, “ffi”, “ffl”, , “ft”, “st”.
TEXT_PRESERVE_WHITESPACE
2 – If set, whitespace is passed through. Otherwise any type of horizontal whitespace (including horizontal tabs)
will be replaced with space characters of variable width. Default is “on” in PyMuPDF.
TEXT_PRESERVE_IMAGES
4 – If set, then images will be stored in the TextPage. This causes the presence of (usually large!) binary image
content in the output of text extractions of types “blocks”, “dict”, “json”, “rawdict”, “rawjson”, “html”, and
“xhtml” and is the default there. If used with “blocks” however, only image metadata will be returned, not the
image itself.
TEXT_INHIBIT_SPACES
8 – If set, Mupdf will not try to add missing space characters where there are large gaps between characters. In
PDF, the creator often does not insert spaces to point to the next character’s position, but will provide the direct
location address. The default in PyMuPDF is “off” – so spaces will be generated.
TEXT_DEHYPHENATE
16 – Ignore hyphens at line ends and join with next line. Used internally with the text search functions. However,
it is generally available: if on, text extractions will return joined text lines (or spans) with the ending hyphen
of the first line eliminated. So two separate spans “first meth-“ and “od leads to wrong results” on different
lines will be joined to one span “first method leads to wrong results” and correspondingly updated bboxes:
the characters of the resulting span will no longer have identical y-coordinates.
TEXT_PRESERVE_SPANS
32 – Generate a new line for every span. Not used (“off”) in PyMuPDF, but available for your use. Every line
in “dict”, “json”, “rawdict”, “rawjson” will contain exactly one span.
TEXT_MEDIABOX_CLIP
64 – If set, characters entirely outside a page’s mediabox will be ignored. This is default n PyMuPDF.
Note: The rightmost byte of this integer is a bit field, so test the truth of these bits with the & operator.
LINK_FLAG_L_VALID
1 (bit 0) Top left x value is valid
Return type bool
LINK_FLAG_T_VALID
2 (bit 1) Top left y value is valid
Return type bool
LINK_FLAG_R_VALID
4 (bit 2) Bottom right x value is valid
Return type bool
LINK_FLAG_B_VALID
8 (bit 3) Bottom right y value is valid
Return type bool
LINK_FLAG_FIT_H
16 (bit 4) Horizontal fit
Return type bool
LINK_FLAG_FIT_V
32 (bit 5) Vertical fit
Return type bool
LINK_FLAG_R_IS_ZOOM
64 (bit 6) Bottom right x is a zoom figure
Return type bool
See chapter 8.4.5, pp. 615 of the Adobe PDF References for details.
These identifiers also cover links and widgets: the PDF specification technically handles them all in the same way,
whereas MuPDF (and PyMuPDF) treats them as three basically different types of objects.
PDF_ANNOT_TEXT 0
PDF_ANNOT_LINK 1 # <=== Link object in PyMuPDF
PDF_ANNOT_FREE_TEXT 2
PDF_ANNOT_LINE 3
PDF_ANNOT_SQUARE 4
PDF_ANNOT_CIRCLE 5
PDF_ANNOT_POLYGON 6
PDF_ANNOT_POLY_LINE 7
PDF_ANNOT_HIGHLIGHT 8
PDF_ANNOT_UNDERLINE 9
PDF_ANNOT_SQUIGGLY 10
PDF_ANNOT_STRIKE_OUT 11
PDF_ANNOT_REDACT 12
PDF_ANNOT_STAMP 13
PDF_ANNOT_CARET 14
PDF_ANNOT_INK 15
PDF_ANNOT_POPUP 16
PDF_ANNOT_FILE_ATTACHMENT 17
PDF_ANNOT_SOUND 18
PDF_ANNOT_MOVIE 19
PDF_ANNOT_RICH_MEDIA 20
PDF_ANNOT_WIDGET 21 # <=== Widget object in PyMuPDF
PDF_ANNOT_SCREEN 22
PDF_ANNOT_PRINTER_MARK 23
PDF_ANNOT_TRAP_NET 24
PDF_ANNOT_WATERMARK 25
(continues on next page)
PDF_ANNOT_LE_NONE 0
PDF_ANNOT_LE_SQUARE 1
PDF_ANNOT_LE_CIRCLE 2
PDF_ANNOT_LE_DIAMOND 3
PDF_ANNOT_LE_OPEN_ARROW 4
PDF_ANNOT_LE_CLOSED_ARROW 5
PDF_ANNOT_LE_BUTT 6
PDF_ANNOT_LE_R_OPEN_ARROW 7
PDF_ANNOT_LE_R_CLOSED_ARROW 8
PDF_ANNOT_LE_SLASH 9
PDF_WIDGET_TYPE_UNKNOWN 0
PDF_WIDGET_TYPE_BUTTON 1
PDF_WIDGET_TYPE_CHECKBOX 2
PDF_WIDGET_TYPE_COMBOBOX 3
PDF_WIDGET_TYPE_LISTBOX 4
PDF_WIDGET_TYPE_RADIOBUTTON 5
PDF_WIDGET_TYPE_SIGNATURE 6
PDF_WIDGET_TYPE_TEXT 7
PDF_WIDGET_TX_FORMAT_NONE 0
PDF_WIDGET_TX_FORMAT_NUMBER 1
PDF_WIDGET_TX_FORMAT_SPECIAL 2
PDF_WIDGET_TX_FORMAT_DATE 3
PDF_WIDGET_TX_FORMAT_TIME 4
Text widgets:
PDF_TX_FIELD_IS_MULTILINE 1 << 12
PDF_TX_FIELD_IS_PASSWORD 1 << 13
PDF_TX_FIELD_IS_FILE_SELECT 1 << 20
PDF_TX_FIELD_IS_DO_NOT_SPELL_CHECK 1 << 22
PDF_TX_FIELD_IS_DO_NOT_SCROLL 1 << 23
PDF_TX_FIELD_IS_COMB 1 << 24
PDF_TX_FIELD_IS_RICH_TEXT 1 << 25
Button widgets:
PDF_BTN_FIELD_IS_NO_TOGGLE_TO_OFF 1 << 14
PDF_BTN_FIELD_IS_RADIO 1 << 15
PDF_BTN_FIELD_IS_PUSHBUTTON 1 << 16
PDF_BTN_FIELD_IS_RADIOS_IN_UNISON 1 << 25
Choice widgets:
PDF_CH_FIELD_IS_COMBO 1 << 17
PDF_CH_FIELD_IS_EDIT 1 << 18
PDF_CH_FIELD_IS_SORT 1 << 19
PDF_CH_FIELD_IS_MULTI_SELECT 1 << 21
PDF_CH_FIELD_IS_DO_NOT_SPELL_CHECK 1 << 22
PDF_CH_FIELD_IS_COMMIT_ON_SEL_CHANGE 1 << 26
MuPDF has defined the following icons for rubber stamp annotations:
STAMP_Approved 0
STAMP_AsIs 1
STAMP_Confidential 2
STAMP_Departmental 3
STAMP_Experimental 4
STAMP_Expired 5
STAMP_Final 6
STAMP_ForComment 7
STAMP_ForPublicRelease 8
STAMP_NotApproved 9
STAMP_NotForPublicRelease 10
STAMP_Sold 11
STAMP_TopSecret 12
STAMP_Draft 13
Color Database
Since the introduction of methods involving colors (like Page.draw_circle()), a requirement may be to have
access to predefined colors.
The fabulous GUI package wxPython has a database of over 540 predefined RGB colors, which are given more or less
memorizable names. Among them are not only standard names like “green” or “blue”, but also “turquoise”, “skyblue”,
and 100 (not only 50 . . . ) shades of “gray”, etc.
We have taken the liberty to copy this database (a list of tuples) modified into PyMuPDF and make its colors available
as PDF compatible float triples: for wxPython’s (“WHITE”, 255, 255, 255) we return (1, 1, 1), which can be directly
used in color and fill parameters. We also accept any mixed case of “wHiTe” to find a color.
As the color database may not be needed very often, one additional import statement seems acceptable to get access
to it:
309
PyMuPDF Documentation, Release 1.19.3
If you want to actually see how the many available colors look like, use scripts colordbRGB.py or colordbHSV.py in
the examples directory. They create PDFs (already existing in the same directory) with all these colors. Their only
difference is sorting order: one takes the RGB values, the other one the Hue-Saturation-Values as sort criteria. This is
a screen print of what these files look like.
Appendix 1: Performance
311
PyMuPDF Documentation, Release 1.19.3
TextPage is one of (Py-) MuPDF’s classes. It is normally created (and destroyed again) behind the curtain, when Page
text extraction methods are used, but it is also available directly and can be used as a persistent object. Other than its
name suggests, images may optionally also be part of a text page:
<page>
<text block>
<line>
<span>
<char>
<image block>
<img>
313
PyMuPDF Documentation, Release 1.19.3
Function TextPage.extractText() (or Page.get_text(“text”)) extracts a page’s plain text in original order as
specified by the creator of the document.
An example output:
>>> print(page.get_text("text"))
Some text on first page.
Note: The output may not equal an accustomed “natural” reading order. However, you can request a reordering
following the scheme “top-left to bottom-right” by executing page.get_text(“text”, sort=True).
13.3 BLOCKS
Where the first 4 items are the float coordinates of the block’s bbox. The lines within each block are concatenated by
a new-line character.
This is a high-speed method, which by default also extracts image meta information: Each image appears as a block
with one text line, which contains meta information. The image itself is not shown.
As with simple text output above, the sort argument can be used as well to obtain a reading order.
Example output:
13.4 WORDS
Function TextPage.extractWORDS() (or Page.get_text(“words”)) extracts a page’s text words as a list of items
like:
Where the first 4 items are the float coordinates of the words’s bbox. The last three integers provide some more
information on the word’s whereabouts.
This is a high-speed method. As with the previous methods, argument sort=True will reorder the words.
Example output:
13.5 HTML
TextPage.extractHTML() (or Page.get_text(“html”) output fully reflects the structure of the page’s TextPage
– much like DICT / JSON below. This includes images, font information and text positions. If wrapped in HTML
header and trailer code, it can readily be displayed by an internet browser. Our above example:
>>> for line in page.get_text("html").splitlines():
print(line)
While HTML output has improved a lot in MuPDF v1.12.0, it is not yet bug-free: we have found problems in the areas
font support and image positioning.
• HTML text contains references to the fonts used of the original document. If these are not known to the browser
(a fat chance!), it will replace them with others; the results will probably look awkward. This issue varies greatly
by browser – on my Windows machine, MS Edge worked just fine, whereas Firefox looked horrible.
• For PDFs with a complex structure, images may not be positioned and / or sized correctly. This seems to be the
case for rotated pages and pages, where the various possible page bbox variants do not coincide (e.g. MediaBox
!= CropBox). We do not know yet, how to address this – we filed a bug at MuPDF’s site.
To address the font issue, you can use a simple utility script to scan through the HTML file and replace font references.
Here is a little example that replaces all fonts with one of the PDF Base 14 Fonts: serifed fonts will become “Times”,
non-serifed “Helvetica” and monospaced will become “Courier”. Their respective variations for “bold”, “italic”, etc.
are hopefully done correctly by your browser:
import sys
filename = sys.argv[1]
otext = open(filename).read() # original html text string
pos1 = 0 # search start poition
font_serif = "font-family:Times" # enter ...
font_sans = "font-family:Helvetica" # ... your choices ...
font_mono = "font-family:Courier" # ... here
found_one = False # true if search successfull
(continues on next page)
while True:
pos0 = otext.find("font-family:", pos1) # start of a font spec
if pos0 < 0: # none found - we are done
break
pos1 = otext.find(";", pos0) # end of font spec
test = otext[pos0 : pos1] # complete font spec string
testn = "" # the new font spec string
if test.endswith(",serif"): # font with serifs?
testn = font_serif # use Times instead
elif test.endswith(",sans-serif"): # sans serifs font?
testn = font_sans # use Helvetica
elif test.endswith(",monospace"): # monospaced font?
testn = font_mono # becomes Courier
if found_one:
ofile = open(filename + ".html", "w")
ofile.write(otext)
ofile.close()
else:
print("Warning: could not find any font specs!")
TextPage.extractDICT() (or Page.get_text(“dict”, sort=False)) output fully reflects the structure of a TextPage
and provides image content and position detail (bbox – boundary boxes in pixel units) for every block, line and span.
Images are stored as bytes for DICT output and base64 encoded strings for JSON output.
For a visuallization of the dictionary structure have a look at Structure of Dictionary Outputs.
Here is how this looks like:
{
"width": 300.0,
"height": 350.0,
"blocks": [{
"type": 0,
"bbox": (50.0, 88.17500305175781, 166.1709747314453, 103.28900146484375),
"lines": ({
"wmode": 0,
"dir": (1.0, 0.0),
"bbox": (50.0, 88.17500305175781, 166.1709747314453, 103.28900146484375),
"spans": ({
"size": 11.0,
"flags": 0,
"font": "Helvetica",
"color": 0,
"origin": (50.0, 100.0),
"text": "Some text on first page.",
"bbox": (50.0, 88.17500305175781, 166.1709747314453, 103.
˓→28900146484375) (continues on next page)
"chars": [{
"origin": (50.0, 100.0),
"bbox": (50.0, 88.17500305175781, 57.336997985839844, 103.28900146484375),
"c": "S"
}, {
"origin": (57.33700180053711, 100.0),
"bbox": (57.33700180053711, 88.17500305175781, 63.4530029296875, 103.
˓→28900146484375),
"c": "o"
}, {
"origin": (63.4530029296875, 100.0),
"bbox": (63.4530029296875, 88.17500305175781, 72.61600494384766, 103.
˓→28900146484375),
"c": "m"
}, {
"origin": (72.61600494384766, 100.0),
"bbox": (72.61600494384766, 88.17500305175781, 78.73200225830078, 103.
˓→28900146484375),
"c": "e"
}, {
"origin": (78.73200225830078, 100.0),
"bbox": (78.73200225830078, 88.17500305175781, 81.79000091552734, 103.
˓→28900146484375),
"c": "."
}],
13.9 XML
The TextPage.extractXML() (or Page.get_text(“xml”)) version extracts text (no images) with the detail level
of RAWDICT:
13.10 XHTML
<div id="page0">
<p>Some text on first page.</p>
</div>
(New in version 1.16.2) Method Page.get_text() supports a keyword parameter flags (int) to control the amount
and the quality of extracted data. The following table shows the defaults settings (flags parameter omitted or None)
for each extraction variant. If you specify flags with a value other than None, be aware that you must set all desired
options. A description of the respective bit settings can be found in Text Extraction Flags.
Indicator text html xhtml xml dict rawdict words blocks search
preserve ligatures 1 1 1 1 1 1 1 1 1
preserve whitespace 1 1 1 1 1 1 1 1 1
preserve images n/a 1 1 n/a 1 1 n/a 0 0
inhibit spaces 0 0 0 0 0 0 0 0 0
dehyphenate 0 0 0 0 0 0 0 0 1
clip to mediabox 1 1 1 1 1 1 1 1 1
13.12 Performance
The text extraction methods differ significantly both: in terms of information they supply, and in terms of resource
requirements and runtimes. Generally, more information of course means, that more processing is required and a
higher data volume is generated.
Note: Especially images have a very significant impact. Make sure to exclude them (via the flags parameter)
whenever you do not need them. To process the below mentioned 2’700 total pages with default flags settings required
160 seconds across all extraction methods. When all images where excluded, less than 50% of that time (77 seconds)
were needed.
To begin with, all methods are very fast in relation to other products out there in the market. In terms of processing
speed, we are not aware of a faster (free) tool. Even the most detailed method, RAWDICT, processes all 1’310 pages
of the Adobe PDF References in less than 5 seconds (simple text needs less than 2 seconds here).
The following table shows average relative speeds (“RSpeed”, baseline 1.00 is TEXT), taken across ca. 1400 text-
heavy and 1300 image-heavy pages.
As mentioned: when excluding image extraction (last column), the relative speeds are changing drastically: except
RAWDICT and XML, the other methods are almost equally fast, and RAWDICT requires 40% less execution time
than the now slowest XML.
Look at chapter Appendix 1 for more performance information.
14.1 General
Starting with version 1.4, PDF supports embedding arbitrary files as part (“Embedded File Streams”) of a PDF docu-
ment file (see chapter “7.11.4 Embedded File Streams”, pp. 103 of the Adobe PDF References).
In many aspects, this is comparable to concepts also found in ZIP files or the OLE technique in MS Windows. PDF
embedded files do, however, not support directory structures as does the ZIP format. An embedded file can in turn
contain embedded files itself.
Advantages of this concept are that embedded files are under the PDF umbrella, benefitting from its permissions /
password protection and integrity aspects: all data, which a PDF may reference or even may be dependent on, can be
bundled into it and so form a single, consistent unit of information.
In addition to embedded files, PDF 1.7 adds collections to its support range. This is an advanced way of storing and
presenting meta information (i.e. arbitrary and extensible properties) of embedded files.
After adding initial support for collections (portfolios) and /EmbeddedFiles in MuPDF version 1.11, this support was
dropped again in version 1.15.
As a consequence, the cli utility mutool no longer offers access to embedded files.
PyMuPDF – having implemented an /EmbeddedFiles API in response in its version 1.11.0 – was therefore forced to
change gears starting with its version 1.16.0 (we never published a MuPDF v1.15.x compatible PyMuPDF).
We are now maintaining our own code basis supporting embedded files. This code makes use of basic MuPDF
dictionary and array functions only.
321
PyMuPDF Documentation, Release 1.19.3
We continue to support the full old API with respect to embedded files – with only minor, cosmetic changes.
There even also is a new function, which delivers a list of all names under which embedded data are resgistered in a
PDF, Document.embfile_names().
This section deals with various technical topics, that are not necessarily related to each other.
Starting with version 1.18.11, the image transformation matrix is returned by some methods for text and image extrac-
tion: Page.get_text() and Page.get_image_bbox().
The transformation matrix contains information about how an image was transformed to fit into the rectangle (its
“boundary box” = “bbox”) on some document page. By inspecting the image’s bbox on the page and this matrix, one
can determine for example, whether and how the image is displayed scaled or rotated on a page.
The relationship between image dimension and its bbox on a page is the following:
1. Using the original image’s width and height,
• define the image rectangle imgrect = fitz.Rect(0, 0, width, height)
• define the “shrink matrix” shrink = fitz.Matrix(1/width, 0, 0, 1/height, 0,
0).
2. Transforming the image rectangle with its shrink matrix, will result in the unit rectangle: imgrect *
shrink = fitz.Rect(0, 0, 1, 1).
3. Using the image transformation matrix “transform”, the following steps will compute the bbox:
4. Inspecting the matrix product shrink * transform will reveal all information about what happened to the
image rectangle to make it fit into the bbox on the page: rotation, scaling of its sides and translation of its origin.
Let us look at an example:
323
PyMuPDF Documentation, Release 1.19.3
>>> #------------------------------------------------
>>> # the above shows:
>>> # image sides are scaled by same factor ~0.4,
>>> # and the image is rotated by 90 degrees clockwise
>>> # compare this with fitz.Matrix(-90) * 0.4
>>> #------------------------------------------------
The following 14 builtin font names must be supported by every PDF viewer application. They are available as a
dictionary, which maps their full names amd their abbreviations in lower case to the full font basename. Whereever a
fontname must be provided in PyMuPDF, any key or value from the dictionary may be used:
In [2]: fitz.Base14_fontdict
Out[2]:
{'courier': 'Courier',
'courier-oblique': 'Courier-Oblique',
'courier-bold': 'Courier-Bold',
'courier-boldoblique': 'Courier-BoldOblique',
'helvetica': 'Helvetica',
'helvetica-oblique': 'Helvetica-Oblique',
'helvetica-bold': 'Helvetica-Bold',
'helvetica-boldoblique': 'Helvetica-BoldOblique',
'times-roman': 'Times-Roman',
'times-italic': 'Times-Italic',
'times-bold': 'Times-Bold',
'times-bolditalic': 'Times-BoldItalic',
'symbol': 'Symbol',
'zapfdingbats': 'ZapfDingbats',
'helv': 'Helvetica',
(continues on next page)
In contrast to their obligation, not all PDF viewers support these fonts correctly and completely – this is especially
true for Symbol and ZapfDingbats. Also, the glyph (visual) images will be specific to every reader.
To see how these fonts can be used – including the CJK built-in fonts – look at the table in Page.insert_font().
This PDF Reference manual published by Adobe is frequently quoted throughout this documentation. It can be viewed
and downloaded from here.
Note: For a long time, an older version was also available under this link. It seems to be taken off the web site in
October 2021. Earlier (pre 1.19.*) versions of the PyMuPDF documentation used to refer to this document. We have
undertaken an effort to replace referrals to the current specification above.
When PyMuPDF objects and methods require a Python list of numerical values, other Python sequence types are also
allowed. Python classes are said to implement the sequence protocol, if they have a __getitem__() method.
This basically means, you can interchangeably use Python list or tuple or even array.array, numpy.array and bytearray
types in these cases.
For example, specifying a sequence "s" in any of the following ways
• s = [1, 2] – a list
• s = (1, 2) – a tuple
• s = array.array("i", (1, 2)) – an array.array
• s = numpy.array((1, 2)) – a numpy array
• s = bytearray((1, 2)) – a bytearray
will make it usable in the following example expressions:
• fitz.Point(s)
• fitz.Point(x, y) + s
• doc.select(s)
Similarly with all geometry objects Rect, IRect, Matrix and Point.
Because all PyMuPDF geometry classes themselves are special cases of sequences, they (with the exception of Quad
– see below) can be freely used where numerical sequences can be used, e.g. as arguments for functions like list(),
tuple(), array.array() or numpy.array(). Look at the following snippet to see this work.
>>> import fitz, array, numpy as np
>>> m = fitz.Matrix(1, 2, 3, 4, 5, 6)
>>>
>>> list(m)
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
>>>
>>> tuple(m)
(1.0, 2.0, 3.0, 4.0, 5.0, 6.0)
>>>
>>> array.array("f", m)
array('f', [1.0, 2.0, 3.0, 4.0, 5.0, 6.0])
>>>
>>> np.array(m)
array([1., 2., 3., 4., 5., 6.])
Note: Quad is a Python sequence object as well and has a length of 4. Its items however are point_like – not
numbers. Therefore, the above remarks do not apply.
PyMuPDF is a Python binding for the C library MuPDF. While a lot of effort has been invested by MuPDF’s creators
to approximate some sort of an object-oriented behavior, they certainly could not overcome basic shortcomings of the
C language in that respect.
Python on the other hand implements the OO-model in a very clean way. The interface code between PyMuPDF and
MuPDF consists of two basic files: fitz.py and fitz_wrap.c. They are created by the excellent SWIG tool for each new
version.
When you use one of PyMuPDF’s objects or methods, this will result in excution of some code in fitz.py, which in turn
will call some C code compiled with fitz_wrap.c.
Because SWIG goes a long way to keep the Python and the C level in sync, everything works fine, if a certain set of
rules is being strictly followed. For example: never access a Page object, after you have closed (or deleted or set to
None) the owning Document. Or, less obvious: never access a page or any of its children (links or annotations) after
you have executed one of the document methods select(), delete_page(), insert_page() . . . and more.
But just no longer accessing invalidated objects is actually not enough: They should rather be actively deleted entirely,
to also free C-level resources (meaning allocated memory).
The reason for these rules lies in the fact that there is a hierachical 2-level one-to-many relationship between a docu-
ment and its pages and also between a page and its links / annotations. To maintain a consistent situation, any of the
above actions must lead to a complete reset – in Python and, synchronously, in C.
SWIG cannot know about this and consequently does not do it.
The required logic has therefore been built into PyMuPDF itself in the following way.
1. If a page “loses” its owning document or is being deleted itself, all of its currently existing annotations and links
will be made unusable in Python, and their C-level counterparts will be deleted and deallocated.
2. If a document is closed (or deleted or set to None) or if its structure has changed, then similarly all currently
existing pages and their children will be made unusable, and corresponding C-level deletions will take place.
“Structure changes” include methods like select(), delePage(), insert_page(), insert_pdf() and so on: all of these
will result in a cascade of object deletions.
The programmer will normally not realize any of this. If he, however, tries to access invalidated objects, exceptions
will be raised.
Invalidated objects cannot be directly deleted as with Python statements like del page or page = None, etc. Instead,
their __del__ method must be invoked.
All pages, links and annotations have the property parent, which points to the owning object. This is the property that
can be checked on the application level: if obj.parent == None then the object’s parent is gone, and any reference to
its properties or methods will raise an exception informing about this “orphaned” state.
A sample session:
Note: Objects outside the above relationship are not included in this mechanism. If you e.g. created a table of
contents by toc = doc.get_toc(), and later close or change the document, then this cannot and does not change variable
toc in any way. It is your responsibility to refresh such variables as required.
The method displays an image of a (“source”) page of another PDF document within a specified rectangle of the
current (“containing”, “target”) page.
• In contrast to Page.insert_image(), this display is vector-based and hence remains accurate across
zooming levels.
• Just like Page.insert_image(), the size of the display is adjusted to the given rectangle.
The following variations of the display are currently supported:
• Bool parameter keep_proportion controls whether to maintain the aspect ratio (default) or not.
• Rectangle parameter clip restricts the visible part of the source page rectangle. Default is the full page.
• float rotation rotates the display by an arbitrary angle (degrees). If the angle is not an integer multiple of 90,
only 2 of the 4 corners may be positioned on the target border if also keep_proportion is true.
• Bool parameter overlay controls whether to put the image on top (foreground, default) of current page content
or not (background).
Use cases include (but are not limited to) the following:
1. “Stamp” a series of pages of the current document with the same image, like a company logo or a watermark.
2. Combine arbitrary input pages into one output page to support “booklet” or double-sided printing (known as
“4-up”, “n-up”).
3. Split up (large) input pages into several arbitrary pieces. This is also called “posterization”, because you e.g.
can split an A4 page horizontally and vertically, print the 4 pieces enlarged to separate A4 pages, and end up
with an A2 version of your original page.
This is done using PDF “Form XObjects”, see section 8.10 on page 217 of Adobe PDF References. On execution of
a Page.show_pdf_page(rect, src, pno, . . . ), the following things happen:
1. The resources and contents objects of page pno in document src are copied over to the current document,
jointly creating a new Form XObject with the following properties. The PDF xref number of this object is
returned by the method.
a. /BBox equals /Mediabox of the source page
b. /Matrix equals the identity matrix [1 0 0 1 0 0]
c. /Resources equals that of the source page. This involves a “deep-copy” of hierarchically
nested other objects (including fonts, images, etc.). The complexity involved here is covered
by MuPDF’s grafting1 technique functions.
1 MuPDF supports “deep-copying” objects between PDF documents. To avoid duplicate data in the target, it uses so-called “graftmaps”, like
a form of scratchpad: for each object to be copied, its xref number is looked up in the graftmap. If found, copying is skipped. Otherwise, the
new xref is recorded and the copy takes place. PyMuPDF makes use of this technique in two places so far: Document.insert_pdf() and
Page.show_pdf_page(). This process is fast and very efficient, because it prevents multiple copies of typically large and frequently referenced
data, like images and fonts. However, you may still want to consider using garbage collection (option 4) in any of the following cases:
1. The target PDF is not new / empty: grafting does not check for resources that already existed (e.g. images, fonts) in the target document
before opening it.
2. Using Page.show_pdf_page() for more than one source document: each grafting occurs within one source PDF only, not across
multiple. So if e.g. the same image exists in pages from different source PDFs, then this will not be detected until garbage collection.
d. This is a stream object type, and its stream is an exact copy of the combined data of the source
page’s /Contents objects.
This step is only executed once per shown source page. Subsequent displays of the same page only
create pointers (done in next step) to this object.
2. A second Form XObject is then created which the target page uses to invoke the display. This object has the
following properties:
a. /BBox equals the /CropBox of the source page (or clip).
b. /Matrix represents the mapping of /BBox to the target rectangle.
c. /XObject references the previous XObject via the fixed name fullpage.
d. The stream of this object contains exactly one fixed statement: /fullpage Do.
3. The resources and contents objects of the target page are now modified as follows.
a. Add an entry to the /XObject dictionary of /Resources with the name fzFrm<n> (with n chosen such that
this entry is unique on the page).
b. Depending on overlay, prepend or append a new object to the page’s /Contents array, containing the
statement q /fzFrm<n> Do Q.
Since MuPDF version 1.16 error and warning messages can be redirected via an official plugin.
PyMuPDF will put error messages to sys.stderr prefixed with the string “mupdf:”. Warnings are internally stored and
can be accessed via fitz.TOOLS.mupdf_warnings(). There also is a function to empty this store.
Change Log
331
PyMuPDF Documentation, Release 1.19.3
• Fixed #1375. Inconsistencies between line numbers as returned by the “words” and the “dict” options of Page.
get_text() have been corrected.
• Fixed #1364. The check for being a "rawdict" span in recover_span_quad() now works correctly.
• Fixed #1342. Corrected the check for rectangle infiniteness in Page.show_pdf_page().
• Changed Page.get_drawings(), Page.get_cdrawings() to return an indicator on the area orien-
tation covered by a rectangle. This implements #1355. Also, the recognition rate for rectangles and quads has
been significantly improved.
• Changed all text search and extraction methods to set the new flags option TEXT_MEDIABOX_CLIP to ON
by default. That bit causes the automatic suppression of all characters that are completely outside a page’s medi-
abox (in as far as that notion is supported for a document type). This eliminates the need for using clip=page.
rect or similar for omitting text outside the visible area.
• Added parameter "dpi" to Page.get_pixmap() and Annot.get_pixmap(). When given, parameter
"matrix" is ignored, and a Pixmap with the desired dots per inch is created.
• Added attributes Pixmap.is_monochrome and Pixmap.is_unicolor allowing fast checks of pixmap
properties. Addresses #1397.
• Added method Pixmap.color_count() to determine the unique colors in the pixmap.
• Added boolean parameter "compress" to PDF document method Document.update_stream(). Ad-
dresses / enables solution for #1408.
A new MuPDF feature is journalling PDF updates, which is also supported by this PyMuPDF version. Changes may
be logged, rolled back or replayed, allowing to implement a whole new level of control over PDF document integrity
– similar to functions present in modern database systems.
A third feature (unrelated to the new MuPDF version) includes the ability to detect when page objects cover or hide
each other. It is now e.g. possible to see that text is covered by a drawing or an image.
• Changed terminology and meaning of important geometry concepts: Rectangles are now characterized as finite,
valid or empty, while the definitions of these terms have also changed. Rectangles specifically are now thought
of being “open”: not all corners and sides are considered part of the retangle. Please do read the Rect section
for details.
• Added new parameter “no_new_id” to Document.save() / Document.tobytes() methods. Use it to
suppress updating the second item of the document /ID which in PDF indicates that the original file has been
updated. If the PDF has no /ID at all yet, then no new one will be created either.
• Added a journalling facility for PDF updates. This allows logging changes, undoing or redoing them, or saving
the journal for later use. Refer to Document.journal_enable() and friends.
• Added new Pixmap methods Pixmap.pdfocr_save() and Pixmap.pdfocr_tobytes(), which gen-
erate a 1-page PDF containing the pixmap as PNG image with OCR text layer.
• Added Page.get_textpage_ocr() which executes optical character recognition for the page, then ex-
tracts the results and stores them together with “normal” page content in a TextPage. Use or reuse this object
in subsequent text extractions and text searches to avoid multiple efforts. The existing text search and text
extraction methods have been extended to support a separately created textpage – see next item.
• Added a new parameter textpage to text extraction and text search methods. This allows reuse of a previously
created TextPage and thus achieves significant runtime benefits – which is especially important for the new OCR
features. But “normal” text extractions can definitely also benefit.
• Added Page.get_texttrace(), a technical method delivering low-level text character properties. It was
present before as a private method, but the author felt it now is mature enough to be officially available. It
specifically includes a “sequence number” which indicates the page appearance build operation that painted the
text.
• Added Page.get_bboxlog() which delivers the list of rectangles of page objects like text, images or
drawings. Its significance lies in its sequence: rectangles intersecting areas with a lower index are covering or
hiding them.
• Changed methods Page.get_drawings() and Page.get_cdrawings() to include a “sequence num-
ber” indicating the page appearance build operation that created the drawing.
• Fixed #1311. Field values in comboboxes should now be handled correctly.
• Fixed #1290. Error was caused by incorrect rectangle emptiness check, which is fixed due to new geometry
logic of this version.
• Fixed #1286. Text alignment for redact annotations is working again.
• Fixed #1287. Infinite loop issue for non-Windows systems when applying some redactions has been resolved.
• Fixed #1284. Text layout destruction after applying redactions in some cases has been resolved.
333
PyMuPDF Documentation, Release 1.19.3
• Fixed issue #1244. Now correctly computing the transform matrix in Page.get_image__bbox().
• Fixed issue #1241. Prevent returning artifact characters in Page.get_textbox(), which happened in cer-
tain constellations.
• Fixed issue #1234. Avoid creating infinite rectangles in corner cases – Page.get_drawings(), Page.
get_cdrawings().
• Added test data and test scripts to the source PyPI source distribution.
• Changed Document.subset_fonts() will now correctly prefix font subsets with an appropriate six letter
uppercase tag, complying with the PDF specification.
• Added new method Widget.button_states() which returns the possible values that a button-type field
can have when being set to “on” or “off”.
• Added support of text with Small Capital letters to the Font and TextWriter classes. This is reflected by an
additional bool parameter small_caps in various of their methods.
335
PyMuPDF Documentation, Release 1.19.3
• Changed TextWriter output to also accept text in right to left mode (Arabian, Hebrew): TextWriter.
fill_textbox(), TextWriter.append(). These methods now accept a new boolean parameter
right_to_left, which is False by default. Implements #897.
• Changed TextWriter.fill_textbox() to return all lines of text, that did not fit in the given rectangle.
Also changed the default of the warn parameter to no longer print a warning message in overflow situations.
• Added a utility function recover_quad(), which computes the quadrilateral of a span. This function can
be used for correctly marking text extracted with the “dict” or “rawdict” options of Page.get_text().
337
PyMuPDF Documentation, Release 1.19.3
Note: This version introduces Python type hinting. The goal is to provide each parameter and the return value of
all functions and methods with type information. This still is work in progress although the majority of functions has
already been handled.
• Changed Page.getImageBbox() to also compute the bbox if the image is contained in an XObject.
• Changed Shape.insertTextbox(), resp. Page.insertTextbox(), resp. TextWriter.
fillTextbox() to respect font’s properties “ascender” / “descender” when computing line height and in-
sertion point. This should no longer lead to line overlaps for multi-line output. These methods used to ignore
font specifics and used constant values instead.
339
PyMuPDF Documentation, Release 1.19.3
– Unsuccessful storage allocations should now always lead to exceptions (circumvention of an upstream
bug intermittently crashing the interpreter).
– Pixmap size is now based on size_t instead of int in C and should be correct even for extremely large
pixmaps.
• Fixed issue #668. Specification of dashes for PDF drawing insertion should now correctly reflect the PDF spec.
• Fixed issue #669. A major source of memory leakage in Page.insert_pdf() has been removed.
• Added keyword “images” to Page.apply_redactions() for fine-controlling the handling of images.
• Added Annot.getText() and Annot.getTextbox(), which offer the same functionality as the Page
versions.
• Added key “number” to the block dictionaries of Page.getText() / Annot.getText() for options
“dict” and “rawdict”.
• Added glyph_name_to_unicode() and unicode_to_glyph_name(). Both functions do not really
connect to a specific font and are now independently available, too. The data are now based on the Adobe Glyph
List.
• Added convenience functions adobe_glyph_names() and adobe_glyph_unicodes() which return
the respective available data.
• Added Page.getDrawings() which returns details of drawing operations on a document page. Works for
all document types.
• Improved performance of Document.insert_pdf(). Multiple object copies are now also suppressed
across multiple separate insertions from the same source. This saves time, memory and target file size. Previ-
ously this mechanism was only active within each single method execution. The feature can also be suppressed
with the new method bool parameter final=1, which is the default.
• For PNG images created from pixmaps, the resolution (dpi) is now automatically set from the respective
Pixmap.xres and Pixmap.yres values.
341
PyMuPDF Documentation, Release 1.19.3
• Fixed an undocumented issue, which prevented fully cleaning a PDF page when using Page.
cleanContents().
• Fixed issue #540. Text extraction for EPUB should again work correctly.
• Fixed issue #548. Documentation now includes LINK_NAMED.
• Added new parameter to control start of text in TextWriter.fillTextbox(). Implements #549.
• Changed documentation of Page.add_redact_annot() to explain the usage of non-builtin fonts.
343
PyMuPDF Documentation, Release 1.19.3
345
PyMuPDF Documentation, Release 1.19.3
• Added Page.load_annot() which loads an annotation given its unique id (/NM key).
• Added Document.reload_page() which provides a new copy of a page after finishing any pending up-
dates to it.
347
PyMuPDF Documentation, Release 1.19.3
• Fixed issue #354 (“SyntaxWarning with Python 3.8”). We now always use “==” for literals (instead of the “is”
Python keyword).
• Fixed issue #353 (“mupdf version check”), to no longer refuse the import when there are only patch level
deviations from MuPDF.
349
PyMuPDF Documentation, Release 1.19.3
• Fixed issues #301 (“Line cap and Line join”), #300 (“How to draw a shape without outlines”) and #298
(“utils.updateRect exception”). These bugs pertain to drawing shapes with PyMuPDF. Drawing shapes without
any border is fully supported. Line cap styles and line line join style are now differentiated and support all
possible PDF values (0, 1, 2) instead of just being a bool. The previous parameter roundCap is deprecated in
favor of lineCap and lineJoin and will be deleted in the next release.
• Fixed issue #290 (“Memory Leak with getText(‘rawDICT’)”). This bug caused memory not being (completely)
freed after invoking the “dict”, “rawdict” and “json” versions of Page.getText().
• Fixed a bug in Page.insertImage() which prevented insertion of multiple images provided as streams.
351
PyMuPDF Documentation, Release 1.19.3
• Changed: output of methods Pixmap.save() and (the new) Pixmap.tobytes() may now also be PSD
(Adobe Photoshop Document).
• Added method Shape.drawQuad() which draws a Quad. This actually is a shorthand for a Shape.
drawPolyline() with the edges of the quad.
• Changed method Shape.drawOval(): the argument can now be either a rectangle (rect_like) or a
quadrilateral (quad_like).
• Changed text searching, Page.searchFor(), to optionally return Quad instead Rect objects surrounding
each search hit.
• Changed plain text output: we now add a n to each line if it does not itself end with this character.
• Fixed issue 211 (“Something wrong in the doc”).
• Fixed issue 213 (“Rewritten outline is displayed only by mupdf-based applications”).
• Fixed issue 214 (“PDF decryption GONE!”).
• Fixed issue 215 (“Formatting of links added with pyMuPDF”).
• Fixed issue 217 (“extraction through json is failing for my pdf”).
Behind the curtain, we have changed the implementation of geometry objects: they now purely exist in Python and
no longer have “shadow” twins on the C-level (in MuPDF). This has improved processing speed in that area by more
than a factor of two.
Because of the same reason, most methods involving geometry parameters now also accept the corresponding Python
sequence. For example, in method “page.show_pdf_page(rect, . . . )” parameter rect may now be any rect_like
sequence.
We also invested considerable effort to further extend and improve the Collection of Recipes chapter.
353
PyMuPDF Documentation, Release 1.19.3
• Added method Document._deleteObject() which deletes a PDF object identified by its xref. Only to
be used by the experienced PDF expert.
• Added a method paper_rect() which returns a Rect for a supplied paper format string. Example:
fitz.paper_rect(“letter”) = fitz.Rect(0.0, 0.0, 612.0, 792.0).
• Added a Collection of Recipes chapter to this document.
• Changed embedded file methods to now also accept or show the PDF unicode filename as additional parameter
ufilename.
• Added Page.add_file_annot() which adds a new file attachment annotation.
• Changed Annot.fileUpd() (file attachment annot) to now also accept the PDF unicode ufilename param-
eter. The description parameter desc correctly works with unicode. Furthermore, all parameters are optional, so
metadata may be changed without also replacing the file content.
• Changed Annot.fileInfo() (file attachment annot) to now also show the PDF unicode filename as pa-
rameter ufilename.
• Fixed issue #180 (“page.getText(output=’dict’) return invalid bbox”) to now also work for vertical text.
• Fixed issue #185 (“Can’t render the annotations created by PyMuPDF”). The issue’s cause was the minimal-
istic MuPDF approach when creating annotations. Several annotation types have no /AP (“appearance”) object
when created by MuPDF functions. MuPDF, SumatraPDF and hence also PyMuPDF cannot render annotations
without such an object. This fix now ensures, that an appearance object is always created together with the
annotation itself. We still do not support line end styles.
355
PyMuPDF Documentation, Release 1.19.3
• Added Document.isDirty which is True if a PDF has been changed in this session. Reset to False on each
Document.save() or Document.write().
• Page.getText() correspondingly supports the new parameter value “dict” to invoke the above method.
• TextPage.extractJSON() (resp. Page.getText(“json”)) is still supported for convenience, but its use is
expected to decline.
357
PyMuPDF Documentation, Release 1.19.3
• To support finding text positions, we have added special methods that don’t need detours like TextPage.
extractJSON() or TextPage.extractXML(): use Page.getTextBlocks() or resp. Page.
getTextWords() to create lists of text blocks or resp. words, which are accompanied by their rectangles.
This should be much faster than the standard text extraction methods and also avoids using additional packages
for interpreting their output.
359
PyMuPDF Documentation, Release 1.19.3
• The Document class now support embedded files with several new methods and one new property:
– embfile_Info() returns metadata information about an entry in the list of embedded files. This is more than
mutool currently provides: it shows all the information that was used to embed the file (not just the entry’s
name).
– embfile_Get() retrieves the (decompressed) content of an entry into a bytes buffer.
– embfile_Add(. . . ) inserts new content into the PDF portfolio. We (in contrast to mutool) restrict this to
entries with a new name (no duplicate names allowed).
– embfile_Del(. . . ) deletes an entry from the portfolio (function not offered in MuPDF).
– embfile_SetInfo() – changes filename or description of an embedded file.
– embfile_Count – contains the number of embedded files.
• Several enhancements deal with streamlining geometry objects. These are not connected to the new MuPDF
version and most of them are also reflected in PyMuPDF v1.10.0. Among them are new properties to identify
the corners of rectangles by name (e.g. Rect.bottom_right) and new methods to deal with set-theoretic questions
like Rect.contains(x) or IRect.intersects(x). Special effort focussed on supporting more “Pythonic” language
constructs: if x in rect . . . is equivalent to rect.contains(x).
• The Rect chapter now has more background on empty amd infinite rectangles and how we handle them. The
handling itself was also updated for more consistency in this area.
• We have started basic support for generation of PDF content:
– Document.insert_page() adds a new page into a PDF, optionally containing some text.
– Page.insertImage() places a new image on a PDF page.
– Page.insertText() puts new text on an existing page
• For FileAttachment annotations, content and name of the attached file can extracted and changed.
code change cannot be avoided in these cases. We assume however, that not many users are actually employing
these rather low level classes explixitely. So the impact of that change should be minor.
Other Changes compared to Version 1.9.3
• The new Document method write() writes an opened PDF to memory (as opposed to a file, like save() does).
• An annotation can now be scaled and moved around on its page. This is done by modifying its rectangle.
• Annotations can now be deleted. Page contains the new method deleteAnnot().
• Various annotation attributes can now be modified, e.g. content, dates, title (= author), border, colors.
• Method Document.insert_pdf() now also copies annotations of source pages.
• The Pages class has been deleted. As documents can now be accessed with page numbers as indices (like
doc[n] = doc.loadPage(n)), and document object can be used as iterators, the benefit of this class was too low
to maintain it. See the following comments.
• loadPage(n) / doc[n] now accept arbitrary integers to specify a page number, as long as n < pageCount. So, e.g.
doc[-500] is always valid and will load page (-500) % pageCount.
• A document can now also be used as an iterator like this: for page in doc: . . . <do something with “page”> . . . .
This will yield all pages of doc as page.
• The Pixmap method getSize() has been replaced with property size. As before Pixmap.size == len(Pixmap) is
true.
• In response to transparency (alpha) being optional, several new parameters and properties have been added to
Pixmap and Colorspace classes to support determining their characteristics.
• The Page class now contains new properties firstAnnot and firstLink to provide starting points to the respective
class chains, where firstLink is just a mnemonic synonym to method loadLinks() which continues to exist.
Similarly, the new property rect is a synonym for method bound(), which also continues to exist.
• Pixmap methods samplesRGB() and samplesAlpha() have been deleted because pixmaps can now be created
without transparency.
• Rect now has a property irect which is a synonym of method round(). Likewise, IRect now has property rect to
deliver a Rect which has the same coordinates as floats values.
• Document has the new method searchPageFor() to search for a text string. It works exactly like the correspond-
ing Page.searchFor() with page number as additional parameter.
361
PyMuPDF Documentation, Release 1.19.3
• For convenience, documents now support simple indexing: doc.loadPage(n) == doc[n]. The index may how-
ever be in range -pageCount < n < pageCount, such that doc[-1] is the last page of the document.
• The new document method select(list) removes all pages from a document that are not contained in the list.
Pages can also be duplicated and re-arranged.
• Various improvements and new members in our demo and examples collections. Perhaps most prominently:
PDF_display now supports scrolling with the mouse wheel, and there is a new example program wxTableExtract
which allows to graphically identify and extract table data in documents.
• fitz.open() is now an alias of fitz.Document().
• New pixmap method tobytes() which will return a bytearray formatted as a PNG image of the pixmap.
• New pixmap method samplesRGB() providing a samples version with alpha bytes stripped off (RGB colorspaces
only).
• New pixmap method samplesAlpha() providing the alpha bytes only of the samples area.
• New iterator fitz.Pages(doc) over a document’s set of pages.
• New matrix methods invert() (calculate inverted matrix), concat() (calculate matrix product), pretranslate()
(perform a shift operation).
• New IRect methods intersect() (intersection with another rectangle), translate() (perform a shift operation).
• New Rect methods intersect() (intersection with another rectangle), transform() (transformation with a matrix),
include_point() (enlarge rectangle to also contain a point), include_rect() (enlarge rectangle to also contain
another one).
• Documented Point.transform() (transform a point with a matrix).
• Matrix, IRect, Rect and Point classes now support compact, algebraic formulations for manipulating such ob-
jects.
• Incremental saves for changes are possible now using the call pattern doc.save(doc.name, incremental=True).
• A PDF’s metadata can now be deleted, set or changed by document method set_metadata(). Supports incremen-
tal saves.
• A PDF’s bookmarks (or table of contents) can now be deleted, set or changed with the entries of a list using
document method set_toc(list). Supports incremental saves.
363
PyMuPDF Documentation, Release 1.19.3
Deprecated Names
The original naming convention for methods and properties has been “camelCase”. Since its creation around 2013,
a tremendous increase of functionality has happened in PyMuPDF – and with it a corresponding increase in classes,
methods and properties. In too many cases, this has led to non-intuitive, illogical and ugly names, difficult to memorize
or guess.
A few versions ago, I therefore decided to shift gears and switch to a “snake_cased” naming standard. This was a
major effort, which needed a step-wise approach. I think am done with it now (version 1.18.14).
The following list maps deprecated names to their new versions. For example, property pageCount became
page_count in the Document class. There also are less obvious name changes, e.g. method getPNGdata was
renamed to tobytes in the Pixmap class.
Names of classes (camel case) and package-wide constants (the majority is upper case) remain untouched.
Old names will remain available as deprecated aliases through MuPDF version 1.19.0 and be removed in the version
that follows it - probably version 1.20.0, but this depends on upstream decisions (MuPDF).
Starting with version 1.19.0, we will issue deprecation warnings on sys.stderr like Deprecation:
'newPage' removed from class 'Document' after v1.19.0 - use 'new_page'. when
aliased methods are being used. Using a deprecated property will not cause this type of warning.
Starting immediately, all deprecated objects (methods and properties) will show a copy of the original’s docstring,
prefixed with the deprecation message, for example:
>>> print(fitz.Document.pageCount.__doc__)
*** Deprecated and removed in version following 1.19.0 - use 'page_count'. ***
Number of pages.
>>> print(fitz.Document.newPage.__doc__)
*** Deprecated and removed in version following 1.19.0 - use 'new_page'. ***
Create and return a new page object.
Args:
pno: (int) insert before this page. Default: after last page.
width: (float) page width in points. Default: 595 (ISO A4 width).
height: (float) page height in points. Default 842 (ISO A4 height).
(continues on next page)
365
PyMuPDF Documentation, Release 1.19.3
There is a utility script alias-changer.py which can be used to do mass-renames in your scripts. It accepts either a
single file or a folder as argument. If a folder is supplied, all its Python files and those of its subfolders are changed.
Optionally, backups of the scripts can be taken.
Deprecated names are not separately documented. The following list will help you find the documentation of the
original.
Note: This is automatically generated. One or two items refer to yet undocumented methods - please simply ignore
them.
• _isWrapped – Page.is_wrapped
• addCaretAnnot – Page.add_caret_annot()
• addCircleAnnot – Page.add_circle_annot()
• addFileAnnot – Page.add_file_annot()
• addFreetextAnnot – Page.add_freetext_annot()
• addHighlightAnnot – Page.add_highlight_annot()
• addInkAnnot – Page.add_ink_annot()
• addLineAnnot – Page.add_line_annot()
• addPolygonAnnot – Page.add_polygon_annot()
• addPolylineAnnot – Page.add_polyline_annot()
• addRectAnnot – Page.add_rect_annot()
• addRedactAnnot – Page.add_redact_annot()
• addSquigglyAnnot – Page.add_squiggly_annot()
• addStampAnnot – Page.add_stamp_annot()
• addStrikeoutAnnot – Page.add_strikeout_annot()
• addTextAnnot – Page.add_text_annot()
• addUnderlineAnnot – Page.add_underline_annot()
• addWidget – Page.add_widget()
• chapterCount – Document.chapter_count
• chapterPageCount – Document.chapter_page_count()
• cleanContents – Page.clean_contents()
• clearWith – Pixmap.clear_with()
• convertToPDF – Document.convert_to_pdf()
• copyPage – Document.copy_page()
• copyPixmap – Pixmap.copy()
• CropBox – Page.cropbox
• CropBoxPosition – Page.cropbox_position
• deleteAnnot – Page.delete_annot()
• deleteLink – Page.delete_link()
• deletePage – Document.delete_page()
• deletePageRange – Document.delete_pages()
• deleteWidget – Page.delete_widget()
• derotationMatrix – Page.derotation_matrix
• drawBezier – Page.draw_bezier()
• drawBezier – Shape.draw_bezier()
• drawCircle – Page.draw_circle()
• drawCircle – Shape.draw_circle()
• drawCurve – Page.draw_curve()
• drawCurve – Shape.draw_curve()
• drawLine – Page.draw_line()
• drawLine – Shape.draw_line()
• drawOval – Page.draw_oval()
• drawOval – Shape.draw_oval()
• drawPolyline – Page.draw_polyline()
• drawPolyline – Shape.draw_polyline()
• drawQuad – Page.draw_quad()
• drawQuad – Shape.draw_quad()
• drawRect – Page.draw_rect()
• drawRect – Shape.draw_rect()
• drawSector – Page.draw_sector()
• drawSector – Shape.draw_sector()
• drawSquiggle – Page.draw_squiggle()
• drawSquiggle – Shape.draw_squiggle()
• drawZigzag – Page.draw_zigzag()
• drawZigzag – Shape.draw_zigzag()
• embeddedFileAdd – Document.embfile_add()
• embeddedFileCount – Document.embfile_count()
• embeddedFileDel – Document.embfile_del()
• embeddedFileGet – Document.embfile_get()
• embeddedFileInfo – Document.embfile_info()
• embeddedFileNames – Document.embfile_names()
• embeddedFileUpd – Document.embfile_upd()
367
PyMuPDF Documentation, Release 1.19.3
• extractFont – Document.extract_font()
• extractImage – Document.extract_image()
• fileGet – Annot.get_file()
• fileUpd – Annot.update_file()
• fillTextbox – TextWriter.fill_textbox()
• findBookmark – Document.find_bookmark()
• firstAnnot – Page.first_annot
• firstLink – Page.first_link
• firstWidget – Page.first_widget
• fullcopyPage – Document.fullcopy_page()
• gammaWith – Pixmap.gamma_with()
• getArea – Rect.get_area()
• getArea – IRect.get_area()
• getCharWidths – Document.get_char_widths()
• getContents – Page.get_contents()
• getDisplayList – Page.get_displaylist()
• getDrawings – Page.get_drawings()
• getFontList – Page.get_fonts()
• getImageBbox – Page.get_image_bbox()
• getImageData – Pixmap.tobytes()
• getImageList – Page.get_images()
• getLinks – Page.get_links()
• getOCGs – Document.get_ocgs()
• getPageFontList – Document.get_page_fonts()
• getPageImageList – Document.get_page_images()
• getPagePixmap – Document.get_page_pixmap()
• getPageText – Document.get_page_text()
• getPageXObjectList – Document.get_page_xobjects()
• getPDFnow – get_pdf_now()
• getPDFstr – get_pdf_str()
• getPixmap – Page.get_pixmap()
• getPixmap – Annot.get_pixmap()
• getPixmap – DisplayList.get_pixmap()
• getPNGData – Pixmap.tobytes()
• getPNGdata – Pixmap.tobytes()
• getRectArea – Rect.get_area()
• getRectArea – IRect.get_area()
• getSigFlags – Document.get_sigflags()
• getSVGimage – Page.get_svg_image()
• getText – Page.get_text()
• getText – Annot.get_text()
• getTextBlocks – Page.get_text_blocks()
• getTextbox – Page.get_textbox()
• getTextbox – Annot.get_textbox()
• getTextLength – get_text_length()
• getTextPage – Page.get_textpage()
• getTextPage – Annot.get_textpage()
• getTextPage – DisplayList.get_textpage()
• getTextWords – Page.get_text_words()
• getToC – Document.get_toc()
• getXmlMetadata – Document.get_xml_metadata()
• ImageProperties – image_properties()
• includePoint – Rect.include_point()
• includePoint – IRect.include_point()
• includeRect – Rect.include_rect()
• includeRect – IRect.include_rect()
• insertFont – Page.insert_font()
• insertImage – Page.insert_image()
• insertLink – Page.insert_link()
• insertPage – Document.insert_page()
• insertPDF – Document.insert_pdf()
• insertText – Page.insert_text()
• insertText – Shape.insert_text()
• insertTextbox – Page.insert_textbox()
• insertTextbox – Shape.insert_textbox()
• invertIRect – Pixmap.invert_irect()
• isConvex – Quad.is_convex
• isDirty – Document.is_dirty
• isEmpty – Rect.is_empty
• isEmpty – IRect.is_empty
• isEmpty – Quad.is_empty
• isFormPDF – Document.is_form_pdf
369
PyMuPDF Documentation, Release 1.19.3
• isInfinite – Rect.is_infinite
• isInfinite – IRect.is_infinite
• isPDF – Document.is_pdf
• isRectangular – Quad.is_rectangular
• isRectilinear – Matrix.is_rectilinear
• isReflowable – Document.is_reflowable
• isRepaired – Document.is_repaired
• isStream – Document.is_stream()
• lastLocation – Document.last_location
• lineEnds – Annot.line_ends
• loadAnnot – Page.load_annot()
• loadLinks – Page.load_links()
• loadPage – Document.load_page()
• makeBookmark – Document.make_bookmark()
• MediaBox – Page.mediabox
• MediaBoxSize – Page.mediabox_size
• metadataXML – Document.xref_xml_metadata()
• movePage – Document.move_page()
• needsPass – Document.needs_pass
• newPage – Document.new_page()
• newShape – Page.new_shape()
• nextLocation – Document.next_location()
• pageCount – Document.page_count
• pageCropBox – Document.page_cropbox()
• pageXref – Document.page_xref()
• PaperRect – paper_rect()
• PaperSize – paper_size()
• paperSizes – paper_sizes
• PDFCatalog – Document.pdf_catalog()
• PDFTrailer – Document.pdf_trailer()
• pillowData – Pixmap.pil_tobytes()
• pillowWrite – Pixmap.pil_save()
• planishLine – planish_line()
• preRotate – Matrix.prerotate()
• preScale – Matrix.prescale()
• preShear – Matrix.preshear()
• preTranslate – Matrix.pretranslate()
• previousLocation – Document.prev_location()
• readContents – Page.read_contents()
• resolveLink – Document.resolve_link()
• rotationMatrix – Page.rotation_matrix
• searchFor – Page.search_for()
• searchPageFor – Document.search_page_for()
• setAlpha – Pixmap.set_alpha()
• setBlendMode – Annot.set_blendmode()
• setBorder – Annot.set_border()
• setColors – Annot.set_colors()
• setCropBox – Page.set_cropbox()
• setFlags – Annot.set_flags()
• setInfo – Annot.set_info()
• setLanguage – Document.set_language()
• setLineEnds – Annot.set_line_ends()
• setMediaBox – Page.set_mediabox()
• setMetadata – Document.set_metadata()
• setName – Annot.set_name()
• setOC – Annot.set_oc()
• setOpacity – Annot.set_opacity()
• setOrigin – Pixmap.set_origin()
• setPixel – Pixmap.set_pixel()
• setRect – Annot.set_rect()
• setRect – Pixmap.set_rect()
• setResolution – Pixmap.set_dpi()
• setRotation – Page.set_rotation()
• setToC – Document.set_toc()
• setXmlMetadata – Document.set_xml_metadata()
• showPDFpage – Page.show_pdf_page()
• soundGet – Annot.get_sound()
• tintWith – Pixmap.tint_with()
• transformationMatrix – Page.transformation_matrix
• updateLink – Page.update_link()
• updateObject – Document.update_object()
• updateStream – Document.update_stream()
371
PyMuPDF Documentation, Release 1.19.3
• wrapContents – Page.wrap_contents()
• writeImage – Pixmap.save()
• writePNG – Pixmap.save()
• writeText – Page.write_text()
• writeText – TextWriter.write_text()
• xrefLength – Document.xref_length()
• xrefObject – Document.xref_object()
• xrefStream – Document.xref_stream()
• xrefStreamRaw – Document.xref_stream_raw()
373
PyMuPDF Documentation, Release 1.19.3
374 Index
PyMuPDF Documentation, Release 1.19.3
Index 375
PyMuPDF Documentation, Release 1.19.3
376 Index
PyMuPDF Documentation, Release 1.19.3
Index 377
PyMuPDF Documentation, Release 1.19.3
378 Index
PyMuPDF Documentation, Release 1.19.3
Index 379
PyMuPDF Documentation, Release 1.19.3
380 Index
PyMuPDF Documentation, Release 1.19.3
Index 381
PyMuPDF Documentation, Release 1.19.3
382 Index
PyMuPDF Documentation, Release 1.19.3
Index 383
PyMuPDF Documentation, Release 1.19.3
O P
object (built-in variable), 299 Page (built-in class), 175
oc page (built-in variable), 299
draw_bezier, 185 page (linkDest attribute), 163
draw_circle, 185 page (Outline attribute), 172
draw_curve, 185 page (Shape attribute), 240
draw_line, 185 page_count (Document attribute), 147
draw_oval, 185 page_cropbox() (Document method), 119
draw_polyline, 185 page_xref() (Document method), 119
draw_quad, 186 pageCount, 370
draw_rect, 185 pageCropBox, 370
draw_sector, 185 pages
draw_squiggle, 185 delete, 57
draw_zigzag, 185 rearrange, 57
finish, 236 pages() (Document method), 120
insert_image, 187 pagetree (built-in variable), 299
insert_text, 184, 236 pageXref, 370
insert_textbox, 184, 238 paper_rect(), 279
OCCD (built-in variable), 300 paper_size(), 278
OCG (built-in variable), 300 paper_sizes(), 282
OCMD (built-in variable), 300 PaperRect, 370
OCPD (built-in variable), 300 PaperSize, 370
opacity (Annot attribute), 101 paperSizes, 370
opacity (TextWriter attribute), 259 parent (Annot attribute), 101
open parent (Page attribute), 204
Document, 110 Partial Pixmaps, 14
filename, 110 PDF
filetype, 110 extract image, 16
fontsize, 110 picture embed, 18
height, 110 pdf_catalog() (Document method), 140
rect, 110 pdf_trailer() (Document method), 140
stream, 110 PDFCatalog, 370
width, 110 pdfocr_save() (Pixmap method), 213
Outline (built-in class), 172 pdfocr_tobytes() (Pixmap method), 213
outline (Document attribute), 145 PDFTrailer, 370
outline_xref() (Document method), 129 permissions (Document attribute), 146
overlay PhotoImage
commit, 240 examples, 21
draw_bezier, 185 Photoshop
draw_circle, 185 examples, 21
draw_curve, 185 picture
draw_line, 185 embed PDF, 18
draw_oval, 185 pil_save() (Pixmap method), 214
draw_polyline, 185 pil_tobytes() (Pixmap method), 214
draw_quad, 186 pillowData, 370
384 Index
PyMuPDF Documentation, Release 1.19.3
Index 385
PyMuPDF Documentation, Release 1.19.3
386 Index
PyMuPDF Documentation, Release 1.19.3
Index 387
PyMuPDF Documentation, Release 1.19.3
388 Index
PyMuPDF Documentation, Release 1.19.3
xrefStream, 372
xrefStreamRaw, 372
xres (Pixmap attribute), 218
Y
y (Pixmap attribute), 218
y (Point attribute), 221
y0 (IRect attribute), 159
y0 (Rect attribute), 230
y1 (IRect attribute), 159
y1 (Rect attribute), 230
yres (Pixmap attribute), 218
Z
zoom, 14
resolution, 14
Index 389