Python PDF Processing: Extract All Images in PDF File Using PyMuPDF

In this tutorial, we will introduce some steps to extract all images in a pdf file using python pymupdf.

Python PDF Processing: Extract All Images in PDF File Using PyMuPDF

1.Install pymupdf library

pip install PyMuPDF

2.Import library

import fitz #the PyMuPDF module
from PIL import Image
import io

3.Open a pdf file and save all images

filename = "my_file.pdf"
# open file
with fitz.open(filename) as my_pdf_file:

    #loop through every page
    for page_number in range(1, len(my_pdf_file)+1):

        # acess individual page
        page = my_pdf_file[page_number-1]

        # accesses all images of the page
        images = page.getImageList()

        # check if images are there
        if images:
            print(f"There are {len(images)} image/s on page number {page_number}[+]")
        else:
            print(f"There are No image/s on page number {page_number}[!]")

        # loop through all images present in the page 
        for image_number, image in enumerate(page.getImageList(), start=1):

            #access image xerf
            xref_value = image[0]
            
            #extract image information
            base_image = my_pdf_file.extractImage(xref_value)

            # access the image itself
            image_bytes = base_image["image"]

            #get image extension
            ext = base_image["ext"]

            #load image
            image = Image.open(io.BytesIO(image_bytes))

            #save image locally
            image.save(open(f"Page{page_number}Image{image_number}.{ext}", "wb"))

In this example code, we will use three steps to save images in pdf.

(1) Get current pdf page

page = my_pdf_file[page_number-1]

(2) Extract all images in the current page

images = page.getImageList()

(3)Get image data and save it

xref_value = image[0]

#extract image information
base_image = my_pdf_file.extractImage(xref_value)

# access the image itself
image_bytes = base_image["image"]

#get image extension
ext = base_image["ext"]

#load image
image = Image.open(io.BytesIO(image_bytes))

#save image locally
image.save(open(f"Page{page_number}Image{image_number}.{ext}", "wb"))