Skip to main content

Command Palette

Search for a command to run...

How I Made PDF Search Actually Fast

Published
4 min read

Let me tell you about a problem that nearly drove me crazy. I was building a PDF viewer, and the search feature was painfully slow. Like, "go make coffee while it searches" slow.

The Usual (Expensive) Way

Here's what most PDF viewers do when you search:

// The expensive approach everyone uses
function searchPDF(query) {
  for (let page = 1; page <= totalPages; page++) {
    renderTextLayer(page);     
    const text = extractText(page);  
    if (text.includes(query)) {
      highlightText(page);      
    }
  }
}

See the problem? For every single page, they're:

  1. Rendering the entire text layer (positions every character on screen)

  2. Then extracting text from that rendered layer

  3. Then searching through it

For a 1000 + -page document, this means rendering 1000 text layers. Each one takes time because it has to calculate the position of every single character. It's like printing an entire book just to find one word.

The Problem

This approach is slow because rendering text layers is expensive. It blocks the UI, freezes the browser, and makes users wonder if something broke.

I needed something better because everything was paid to use .

My Solution: The Smart Hack

Here's the thing I realized: you don't need to render text on screen to search through it.

PDF.js (the library I was using) lets you extract raw text data directly from the PDF without rendering anything.
My hack was simple: extract text from all pages silently in the background, search through that data, and only render text layers for the page you're actually viewing.

Think of it like this:

  • Normal way: Print every page, then search through the printed pages

  • My way: Keep a digital copy, search that instantly, only print the page you're reading

The Implementation

1. Extract Text Content Upfront (Not the Visual Layer!)

When the PDF loads, I extract just the raw text content from all pages in the background. This is NOT the same as rendering the text layer.

  • Text content: Just the raw text strings from the PDF ("Hello world", "Your mom", etc.) - super lightweight

  • Text layer: The visual overlay with positioned characters that you see on screen - expensive to render

// Run this when PDF loads
const extractAllPagesText = async (totalPages) => {
  for (let i = 1; i <= totalPages; i += 5) {
    const batch = [i, i+1, i+2, i+3, i+4];

    await Promise.all(
      batch.map(async (pageNum) => {
        const page = await pdf.getPage(pageNum);
        const textContent = await page.getTextContent();
        pageTexts[pageNum] = textContent;
      })
    );
  }
};

This takes a couple seconds, but happens in the background. Users can start reading immediately. I'm just storing text strings, not rendering anything.

2. Search Through Raw Text Content

Now when someone types a search query, I just scan through the raw text content I already extracted:

const searchInDocument = (query) => {
  const results = [];

  for (let page = 1; page <= totalPages; page++) {
    const textContent = pageTexts[page]; // Raw text content from memory

    const pageText = textContent.items
      .map(item => item.str)
      .join(' ');

    if (pageText.includes(query)) {
      results.push(page);
    }
  }

  return results; // Super fast!
};

This is very fast because I'm just searching through JavaScript strings in memory. No rendering involved at all.

3. Only Render Text Layers for What's Visible

Here's the clever bit. I only render the visual text layer (the thing that shows yellow highlights) for three pages: the current page and the ones directly before and after it.


const pagesToRender = [
  currentPage - 1,
  currentPage,
  currentPage + 1
];

So if you're on page 42 of a 100-page document, I'm only rendering visual text layers for pages 41, 42, and 43. The other 97 pages? Just the PDF image, no text layer.

When you jump to a search result on page 75, I update to render text layers for pages 74, 75, and 76 instead.

How It Feels in Practice

Let's say you search for "revenue" in a 100-page report:

  1. You type "revenue" and hit enter

  2. Search runs through all 100 pages in memory → takes maybe 50ms

  3. Results show up: "Found on pages 5, 12, and 47"

  4. You're on page 5, so pages 4-6 have text layers rendered with highlights

  5. You click "next result" to jump to page 12

  6. Text layers update to pages 11-13, highlights show up instantly

  7. Smooth sailing

The Difference

Usual implementation:

  • Search triggered → Render 100 text layers → Extract text from layers → Search → Show results

  • Time: 30-60 seconds for 100 pages

My implementation:

  • PDF loads → Extract raw text content (not layers!) in 2-3 seconds

  • Search triggered → Search raw text in memory (50ms) → Render only 3 text layers for highlights → Show results

  • Time: Instant

The key: I search through lightweight text content, not expensive rendered layers.

Note

Add a debounce

// Wait for user to stop typing
useEffect(() => {
  const timer = setTimeout(() => {
    searchInDocument(searchQuery);
  }, 300);

  return () => clearTimeout(timer);
}, [searchQuery]);

The Takeaway

It's just understanding that you can separate searching the data from showing the highlights.

Extract once, store smart, render only what's visible. That's it.

Now my PDF search feels instant, even on huge documents. And I can finally drink my coffee in peace.