How We Built a Bank Statement Parser Using Claude's Vision API
The Problem: Bank Statements Are Visual Documents, Not Structured Data
Bank statements look simple. Rows of dates, descriptions, and amounts arranged in neat columns. You'd think extracting this data programmatically would be straightforward. It isn't.
A bank statement PDF is a rendering instruction set. It tells a viewer where to draw text on a page, character by character. There are no tables, no rows, no columns in the underlying data. What appears as a clean transaction table is actually hundreds of independent text-placement commands scattered across the file. Two characters that appear adjacent on screen might be megabytes apart in the raw PDF.
We spent three months building a traditional OCR pipeline before scrapping it. Every bank formatted statements differently. Chase puts the date on the left with a two-line description. Bank of America uses a single-line format with the running balance on the right. Wells Fargo sometimes splits a single transaction across a page break. Each quirk required new parsing rules, and every rule introduced new edge cases.
When Anthropic released Claude's vision capabilities, we realized there was a fundamentally different approach: stop trying to reconstruct structure from raw PDF data, and instead look at the document the way a human does.
Why We Chose Claude Vision Over OCR + Regex
Our original pipeline used Tesseract for OCR, followed by a regex-based extraction layer. The approach worked for exactly one bank at a time. Here's what a typical parsing rule looked like:
// The kind of code we were writing before Claude Vision.
// One regex per bank. Breaks constantly.
const CHASE_TRANSACTION_PATTERN =
/^(\d{2}\/\d{2})\s+(.+?)\s{2,}(-?\$?[\d,]+\.\d{2})\s*$/gm;
const BOA_TRANSACTION_PATTERN =
/^(\d{2}\/\d{2}\/\d{2})\s+(\S.+?)\s+(-?[\d,]+\.\d{2})\s+([\d,]+\.\d{2})$/gm;
// Then the special cases started:
// - Multi-line descriptions? Different regex.
// - Pending transactions in italics? Tesseract misreads them.
// - Scanned (non-digital) PDFs? Completely different pipeline.
// - Spanish-language statements from BBVA? Another set of rules.The regex approach has a fundamental scaling problem. Each new bank requires reverse-engineering its exact format, writing custom parsing rules, and maintaining those rules as the bank updates its statement layout. We were maintaining over 40 bank-specific regex patterns, and the long tail of smaller banks was essentially unsupported.
| OCR + Regex Pipeline | Claude Vision Pipeline | |
|---|---|---|
| New bank support | Days of reverse engineering | Works on first upload |
| Format changes | Breaks silently | Self-adapting |
| Multi-line descriptions | Complex lookahead regex | Handled natively |
| Scanned documents | Separate OCR pipeline | Same pipeline |
| Maintenance burden | Grows linearly with banks | Near-zero marginal cost |
| Per-page cost | ~$0.001 (compute only) | ~$0.01-0.03 (API cost) |
| Accuracy (digital PDFs) | ~88% across banks | ~97% across banks |
The cost difference is real, but the accuracy and maintenance tradeoffs made the decision clear. We were spending more on engineering time maintaining regex patterns than we'd ever spend on API calls.
Architecture Overview
The pipeline is simpler than what it replaced. A PDF comes in, we convert it to page images, send each page to Claude Vision for extraction, then validate and reconcile the output.
PDF Upload
|
v
[pdf-to-image] ---- Convert each page to PNG at 300 DPI
|
v
[page-classifier] -- Determine page type (transactions, summary, disclosures)
|
v
[claude-vision] ---- Extract structured data from transaction pages
|
v
[validator] -------- Cross-check totals, detect missing transactions
|
v
[categorizer] ------ Classify merchants via second AI pass
|
v
[output-formatter] - Generate CSV/Excel/JSON outputPDF to Image Conversion
We use pdf-lib for initial PDF analysis and sharp for image conversion. The key decision was rendering at 300 DPI. At 150 DPI, Claude occasionally misreads digits in amounts (a $1,234.56 becomes $1,234.86). At 600 DPI, the images are large enough to hit API payload limits and the improvement in accuracy is negligible.
import { fromPath } from "pdf2pic";
interface PageImage {
pageNumber: number;
buffer: Buffer;
width: number;
height: number;
}
async function convertPdfToImages(pdfPath: string): Promise<PageImage[]> {
const converter = fromPath(pdfPath, {
density: 300,
format: "png",
width: 2550, // 8.5" at 300 DPI
height: 3300, // 11" at 300 DPI
});
const pageCount = await getPdfPageCount(pdfPath);
const images: PageImage[] = [];
for (let i = 1; i <= pageCount; i++) {
const result = await converter(i, { responseType: "buffer" });
images.push({
pageNumber: i,
buffer: result.buffer,
width: result.width,
height: result.height,
});
}
return images;
}Prompt Engineering for Transaction Extraction
This is where we spent the most iteration time. The prompt has to do several things simultaneously: identify the transaction table, parse each row correctly, handle the bank's specific formatting, and output clean structured JSON. We went through roughly 30 prompt versions before settling on our current approach.
The System Prompt
Our system prompt establishes the extraction contract. We found that being extremely specific about the output schema reduced hallucinations significantly.
const EXTRACTION_SYSTEM_PROMPT = `You are a bank statement data extraction system.
Your job is to extract every transaction from the provided bank statement page image.
CRITICAL RULES:
1. Extract ONLY transactions that appear on this page. Do not infer or fabricate transactions.
2. If a transaction description spans multiple lines, concatenate them into a single description.
3. Dates should be normalized to YYYY-MM-DD format. Use the statement year from the header.
4. Amounts must be exact. Never round. Include the sign: negative for debits, positive for credits.
5. If you cannot confidently read a value, set "confidence": "low" for that transaction.
6. Ignore summary rows, subtotals, and balance-forward entries.
Output ONLY valid JSON matching this schema:
{
"transactions": [
{
"date": "YYYY-MM-DD",
"description": "string",
"amount": number,
"balance": number | null,
"type": "debit" | "credit",
"confidence": "high" | "low"
}
],
"pageMetadata": {
"statementPeriod": "string or null",
"accountLastFour": "string or null",
"pageNumber": number,
"totalPages": number | null
}
}`;The Extraction Call
Each page gets its own API call. We considered batching multiple pages into a single request, but found that accuracy dropped noticeably when Claude had to process more than two pages at once. The context window isn't the issue; it's that the model's attention to fine-grained numerical details degrades as the image count increases.
import Anthropic from "@anthropic-ai/sdk";
const anthropic = new Anthropic();
interface ExtractionResult {
transactions: Transaction[];
pageMetadata: PageMetadata;
}
async function extractTransactionsFromPage(
pageImage: PageImage,
statementContext?: { year: number; bankName?: string }
): Promise<ExtractionResult> {
const base64Image = pageImage.buffer.toString("base64");
const userPrompt = statementContext?.bankName
? `Extract all transactions from this ${statementContext.bankName} bank statement page.
The statement year is ${statementContext.year}.`
: "Extract all transactions from this bank statement page.";
const response = await anthropic.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 4096,
system: EXTRACTION_SYSTEM_PROMPT,
messages: [
{
role: "user",
content: [
{
type: "image",
source: {
type: "base64",
media_type: "image/png",
data: base64Image,
},
},
{ type: "text", text: userPrompt },
],
},
],
});
const text = response.content[0].type === "text" ? response.content[0].text : "";
return JSON.parse(text) as ExtractionResult;
}Prompt Iteration Lesson
Early versions of our prompt used phrases like "try to extract" and "do your best." Changing to imperative language ("Extract ONLY transactions," "Amounts must be exact") measurably improved accuracy. The model responds well to precision in instructions.
Handling Edge Cases
The elegance of the vision-based approach is that many edge cases that required explicit handling in the regex pipeline just work. But some required real engineering.
Multi-Line Descriptions
Some banks split transaction descriptions across two or three lines. Chase does this frequently with card purchases, showing the merchant name on one line and the city/state on the next. Claude handles this natively because it sees the visual layout, but we needed to ensure it concatenated properly rather than creating duplicate entries.
Our prompt rule #2 ("If a transaction description spans multiple lines, concatenate them into a single description") handles most cases. For the remaining few, we added a post-processing step that detects suspiciously incomplete transactions (entries with a description but no amount) and merges them with the preceding transaction.
Page Break Transactions
This was our hardest edge case. Sometimes a transaction starts at the bottom of one page and continues at the top of the next. The date and description appear on page N, but the amount appears on page N+1. Claude extracts a partial transaction from each page.
function reconcilePageBreaks(pages: ExtractionResult[]): Transaction[] {
const allTransactions: Transaction[] = [];
for (let i = 0; i < pages.length; i++) {
const currentPage = pages[i].transactions;
const nextPage = pages[i + 1]?.transactions;
for (const tx of currentPage) {
// Detect partial transaction at page boundary:
// has a date and description, but amount is 0 or null
if (tx.amount === 0 && nextPage && nextPage.length > 0) {
const continuation = nextPage[0];
// If next page starts with a transaction that has no date,
// it's likely a continuation
if (!continuation.date || continuation.date === "") {
tx.amount = continuation.amount;
tx.balance = continuation.balance;
tx.description += " " + continuation.description;
nextPage.shift(); // remove the merged entry
}
}
allTransactions.push(tx);
}
}
return allTransactions;
}Running Balance Validation
Some banks include a running balance column. This is incredibly useful for validation. If transaction N has a balance of $5,000 and transaction N+1 is a debit of $200, we expect transaction N+1's balance to be $4,800. When these don't match, we know something was misread.
interface ValidationResult {
isValid: boolean;
errors: ValidationError[];
confidence: number;
}
function validateRunningBalances(transactions: Transaction[]): ValidationResult {
const errors: ValidationError[] = [];
for (let i = 1; i < transactions.length; i++) {
const prev = transactions[i - 1];
const curr = transactions[i];
if (prev.balance !== null && curr.balance !== null) {
const expectedBalance = prev.balance + curr.amount;
const diff = Math.abs(expectedBalance - curr.balance);
if (diff > 0.01) { // floating point tolerance
errors.push({
type: "balance_mismatch",
transactionIndex: i,
expected: expectedBalance,
actual: curr.balance,
diff,
});
}
}
}
const confidence = 1 - errors.length / transactions.length;
return { isValid: errors.length === 0, errors, confidence };
}The Validation Pipeline
Extraction without validation is guessing. Our validation pipeline runs four independent checks and flags any statement where confidence drops below our threshold.
- Running balance check: If the statement includes balances, verify each transaction's balance follows from the previous one. Catches misread digits.
- Statement total reconciliation: Most statements include a summary section with total debits, total credits, and ending balance. We extract these from the summary page and compare against our calculated totals. A mismatch of more than $0.01 triggers a re-extraction with a higher-resolution image.
- Transaction count heuristic: We estimate the expected number of transactions from the physical space on each page (roughly 25-35 per full page for most banks). If we extract significantly fewer, a transaction may have been skipped.
- Duplicate detection: Identical date + amount + description combinations are flagged. Genuine duplicates exist (two $5.00 Starbucks charges on the same day) but are rare enough to warrant flagging.
async function validateExtraction(
transactions: Transaction[],
summaryData: StatementSummary | null
): Promise<ValidationResult> {
const checks = await Promise.all([
validateRunningBalances(transactions),
validateAgainstSummary(transactions, summaryData),
validateTransactionDensity(transactions),
detectDuplicates(transactions),
]);
const allErrors = checks.flatMap((c) => c.errors);
const overallConfidence = checks.reduce((acc, c) => acc * c.confidence, 1);
// If confidence is below threshold, retry with enhanced extraction
if (overallConfidence < 0.85) {
return {
isValid: false,
errors: allErrors,
confidence: overallConfidence,
requiresReExtraction: true,
};
}
return {
isValid: allErrors.length === 0,
errors: allErrors,
confidence: overallConfidence,
requiresReExtraction: false,
};
}Re-extraction Strategy
When validation fails, we re-extract the problematic pages at 600 DPI with an augmented prompt that includes the specific error context (e.g., "The running balance after transaction 14 should be approximately $3,241.50. Please re-verify all transactions on this page."). This targeted retry resolves about 70% of extraction errors without human intervention.
Performance and Cost Optimization
A typical bank statement has 3-8 pages. At Anthropic's current pricing, each page costs roughly $0.01-0.03 to process depending on image size and output length. For a 5-page statement, total extraction cost is about $0.08-0.15. That's viable for a paid product, but we still optimized aggressively.
Page Classification
Not every page in a statement contains transactions. The first page is usually an account summary. The last few pages are often disclosures and legal text. Sending these to the full extraction pipeline wastes money and can introduce noise (the model might try to extract table-like data from fee schedules).
We added a lightweight classification step that uses Claude Haiku to categorize each page before extraction. Haiku is fast and cheap enough that classifying all pages costs less than running the full extraction on a single non-transaction page.
type PageType = "transactions" | "summary" | "disclosures" | "other";
async function classifyPage(pageImage: PageImage): Promise<PageType> {
const response = await anthropic.messages.create({
model: "claude-3-5-haiku-20241022",
max_tokens: 50,
messages: [
{
role: "user",
content: [
{
type: "image",
source: {
type: "base64",
media_type: "image/png",
data: pageImage.buffer.toString("base64"),
},
},
{
type: "text",
text: "Classify this bank statement page. Reply with ONLY one of: transactions, summary, disclosures, other",
},
],
},
],
});
const text = response.content[0].type === "text"
? response.content[0].text.trim().toLowerCase()
: "other";
return text as PageType;
}Parallel Processing and Caching
Page extraction calls are independent and can run in parallel. We use Promise.allSettled with a concurrency limiter to process up to 5 pages simultaneously without hitting rate limits. For repeat uploads of the same statement (common during testing and debugging), we cache results keyed on a hash of the page image buffer.
import pLimit from "p-limit";
import { createHash } from "crypto";
const limit = pLimit(5); // max 5 concurrent API calls
function getPageHash(buffer: Buffer): string {
return createHash("sha256").update(buffer).digest("hex");
}
async function extractAllPages(
pages: PageImage[],
cache: Map<string, ExtractionResult>
): Promise<ExtractionResult[]> {
const tasks = pages.map((page) =>
limit(async () => {
const hash = getPageHash(page.buffer);
if (cache.has(hash)) return cache.get(hash)!;
const result = await extractTransactionsFromPage(page);
cache.set(hash, result);
return result;
})
);
const results = await Promise.allSettled(tasks);
return results
.filter((r): r is PromiseFulfilledResult<ExtractionResult> =>
r.status === "fulfilled"
)
.map((r) => r.value);
}The Categorization Layer
Raw transaction descriptions from bank statements are cryptic. "CHECKCARD 0215 WHOLEFDS MKT #10847" means a Whole Foods purchase. "SQ *BLUE BOTTLE COF" is a Square payment to Blue Bottle Coffee. Users need clean merchant names and spending categories.
We use a separate AI pass for categorization. This runs after extraction and validation, processing all transactions in a single batch call rather than one per transaction. This is an important architectural decision: keeping extraction and categorization separate means a categorization error never corrupts the raw financial data.
interface CategorizedTransaction extends Transaction {
merchantName: string; // "Whole Foods Market"
category: string; // "Groceries"
subcategory: string; // "Supermarket"
}
async function categorizeTransactions(
transactions: Transaction[]
): Promise<CategorizedTransaction[]> {
// Batch all descriptions into a single call
const descriptions = transactions.map((tx, i) => `${i}: ${tx.description}`);
const response = await anthropic.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 4096,
system: `You are a transaction categorization system.
For each numbered transaction description, provide:
- merchantName: The clean, human-readable merchant name
- category: One of [Groceries, Dining, Transport, Shopping, Bills, Entertainment, Health, Income, Transfer, Other]
- subcategory: A more specific label
Output JSON array matching the input ordering.`,
messages: [
{
role: "user",
content: `Categorize these transactions:\n${descriptions.join("\n")}`,
},
],
});
const text = response.content[0].type === "text" ? response.content[0].text : "[]";
const categories = JSON.parse(text) as Array<{
merchantName: string;
category: string;
subcategory: string;
}>;
return transactions.map((tx, i) => ({
...tx,
merchantName: categories[i]?.merchantName ?? tx.description,
category: categories[i]?.category ?? "Other",
subcategory: categories[i]?.subcategory ?? "Unknown",
}));
}Categorization accuracy sits around 90% for common merchants and drops for obscure or highly abbreviated descriptions. We've considered fine-tuning or building a lookup table for the top 1,000 merchants, but the diminishing returns haven't justified the investment yet.
Results
We've processed over 50,000 bank statement pages through this pipeline. Here's where accuracy stands across different bank types:
| Bank Type | Transaction Accuracy | Amount Accuracy | Notes |
|---|---|---|---|
| Major US banks (Chase, BofA, Wells Fargo) | 97.2% | 99.1% | Best results due to clean digital PDFs |
| Regional/community banks | 95.4% | 98.3% | Slightly more format variation |
| Credit unions | 93.8% | 97.6% | Older PDF generators, less consistent layouts |
| International banks | 91.2% | 96.8% | Multi-currency and language challenges |
| Scanned/photographed statements | 89.5% | 94.2% | Image quality is the limiting factor |
"Transaction accuracy" measures whether we correctly identified every transaction on the page. "Amount accuracy" measures whether the dollar amounts are exactly right. Amount accuracy is higher because even when we miss a transaction entirely, the ones we do extract tend to have correct amounts.
The most common failure mode is not misreading a number. It's skipping a transaction entirely, usually because it's in an unusual position on the page (a fee listed outside the main transaction table, or a correction entry in a footnote). The second most common failure is misattributing a transaction's sign: interpreting a credit as a debit or vice versa, especially on statements where the sign convention is indicated by column position rather than explicit +/- symbols.
What We'd Do Differently
If we were starting this project today, with the benefit of everything we've learned, here's what we'd change:
- Start with vision from day one. We spent three months on the OCR pipeline before pivoting. The regex approach felt more "engineered" and controllable, which was seductive but wrong. Vision-first would have saved us a quarter.
- Build the validation pipeline first. We built extraction first and validation as an afterthought. In hindsight, having robust validation early would have made prompt iteration much faster. You can't improve what you can't measure.
- Use structured output from the start. We initially parsed JSON from Claude's free-text responses using string manipulation. Switching to asking for pure JSON output with a strict schema reduced parsing errors to near zero. Anthropic's tool-use API with JSON schema validation would be our starting point today.
- Invest more in the page classifier. Our simple Haiku-based classifier works, but a purpose-built model would be faster and cheaper. Page classification is a solved problem and doesn't need an LLM.
- Don't over-optimize on cost too early. We spent weeks trying to reduce API calls before we had product-market fit. The cost of Claude API calls is trivial compared to the engineering time spent optimizing. Get accuracy right first, then optimize.
A Note on Accuracy Claims
We're reporting our real numbers, not cherry-picked benchmarks. Your results will vary depending on the quality of input PDFs, the banks you're processing, and how you define "accuracy." We measure strictly: a transaction is either perfectly extracted or it's wrong. Some competitors report fuzzy accuracy metrics where close-enough amounts count as correct. Be skeptical of any solution claiming 99%+ accuracy across all banks.
Try It Yourself
We've built this pipeline into StatementVision, so you don't have to build it yourself. Upload a bank statement PDF and get structured, categorized transaction data in CSV or Excel format in seconds. The first three statements are free, no account required.
If you're a developer building something similar, we hope this writeup saves you some of the trial and error we went through. The core insight is simple: let the AI see the document the way you do, validate aggressively, and don't fight the format wars with regex.
See the parser in action. Upload any bank statement and get clean, structured data.
Try StatementVision Free