How I Made PDF Search Actually Fast
Let me tell you about a problem that nearly drove me crazy. I was building a PDF viewer, and the search feature was painfully slow. Like, "go make coffee while it searches" slow.
The Usual (Expensive) Way
Here's what most PDF viewers do when you search:
// The expensive approach everyone uses
function searchPDF(query) {
for (let page = 1; page <= totalPages; page++) {
renderTextLayer(page);
const text = extractText(page);
if (text.includes(query)) {
highlightText(page);
}
}
}
See the problem? For every single page, they're:
Rendering the entire text layer (positions every character on screen)
Then extracting text from that rendered layer
Then searching through it
For a 1000 + -page document, this means rendering 1000 text layers. Each one takes time because it has to calculate the position of every single character. It's like printing an entire book just to find one word.
The Problem
This approach is slow because rendering text layers is expensive. It blocks the UI, freezes the browser, and makes users wonder if something broke.
I needed something better because everything was paid to use .
My Solution: The Smart Hack
Here's the thing I realized: you don't need to render text on screen to search through it.
PDF.js (the library I was using) lets you extract raw text data directly from the PDF without rendering anything.
My hack was simple: extract text from all pages silently in the background, search through that data, and only render text layers for the page you're actually viewing.
Think of it like this:
Normal way: Print every page, then search through the printed pages
My way: Keep a digital copy, search that instantly, only print the page you're reading
The Implementation
1. Extract Text Content Upfront (Not the Visual Layer!)
When the PDF loads, I extract just the raw text content from all pages in the background. This is NOT the same as rendering the text layer.
Text content: Just the raw text strings from the PDF ("Hello world", "Your mom", etc.) - super lightweight
Text layer: The visual overlay with positioned characters that you see on screen - expensive to render
// Run this when PDF loads
const extractAllPagesText = async (totalPages) => {
for (let i = 1; i <= totalPages; i += 5) {
const batch = [i, i+1, i+2, i+3, i+4];
await Promise.all(
batch.map(async (pageNum) => {
const page = await pdf.getPage(pageNum);
const textContent = await page.getTextContent();
pageTexts[pageNum] = textContent;
})
);
}
};
This takes a couple seconds, but happens in the background. Users can start reading immediately. I'm just storing text strings, not rendering anything.
2. Search Through Raw Text Content
Now when someone types a search query, I just scan through the raw text content I already extracted:
const searchInDocument = (query) => {
const results = [];
for (let page = 1; page <= totalPages; page++) {
const textContent = pageTexts[page]; // Raw text content from memory
const pageText = textContent.items
.map(item => item.str)
.join(' ');
if (pageText.includes(query)) {
results.push(page);
}
}
return results; // Super fast!
};
This is very fast because I'm just searching through JavaScript strings in memory. No rendering involved at all.
3. Only Render Text Layers for What's Visible
Here's the clever bit. I only render the visual text layer (the thing that shows yellow highlights) for three pages: the current page and the ones directly before and after it.
const pagesToRender = [
currentPage - 1,
currentPage,
currentPage + 1
];
So if you're on page 42 of a 100-page document, I'm only rendering visual text layers for pages 41, 42, and 43. The other 97 pages? Just the PDF image, no text layer.
When you jump to a search result on page 75, I update to render text layers for pages 74, 75, and 76 instead.
How It Feels in Practice
Let's say you search for "revenue" in a 100-page report:
You type "revenue" and hit enter
Search runs through all 100 pages in memory → takes maybe 50ms
Results show up: "Found on pages 5, 12, and 47"
You're on page 5, so pages 4-6 have text layers rendered with highlights
You click "next result" to jump to page 12
Text layers update to pages 11-13, highlights show up instantly
Smooth sailing
The Difference
Usual implementation:
Search triggered → Render 100 text layers → Extract text from layers → Search → Show results
Time: 30-60 seconds for 100 pages
My implementation:
PDF loads → Extract raw text content (not layers!) in 2-3 seconds
Search triggered → Search raw text in memory (50ms) → Render only 3 text layers for highlights → Show results
Time: Instant
The key: I search through lightweight text content, not expensive rendered layers.
Note
Add a debounce
// Wait for user to stop typing
useEffect(() => {
const timer = setTimeout(() => {
searchInDocument(searchQuery);
}, 300);
return () => clearTimeout(timer);
}, [searchQuery]);
The Takeaway
It's just understanding that you can separate searching the data from showing the highlights.
Extract once, store smart, render only what's visible. That's it.
Now my PDF search feels instant, even on huge documents. And I can finally drink my coffee in peace.
