AI & ML8 min read·12 Mar 2026

How to Build a Fraud-Resilient Bank Statement Analysis Pipeline

Statement fraud in India has evolved from crude edits in Notepad to professionally fabricated PDFs. Here's what a modern detection architecture actually needs to catch it reliably.

Subhalaxmi Das

Co-founder & CTO · Santulan

How to Build a Fraud-Resilient Bank Statement Analysis Pipeline

Bank statement fraud in India has followed the predictable trajectory of all financial fraud: as detection methods have improved, so have the fraud techniques. What began as amateur edits in Notepad or Word has evolved into a cottage industry of professionally produced fabricated statements, complete with correct fonts, accurate running balances (for one reading at least), and transaction descriptions that pass visual inspection by most analysts.

The Taxonomy of Statement Fraud

Understanding what you're defending against is the starting point for building an effective detection system. Bank statement fraud in India roughly divides into four categories:

Category 1 — Balance inflation. The most common and least sophisticated type. Existing authentic transactions are kept, but opening and closing balances are inflated. Running balance calculations typically break down on close inspection.

Category 2 — Transaction editing. Specific transactions are modified — a ₹22,000 salary credit becomes ₹85,000, for example. This is more work to execute and harder to detect at a surface level, but creates mathematical inconsistencies in the running balance.

Category 3 — Transaction insertion. New transactions are added to an otherwise authentic statement. The challenge for the fraudster is maintaining consistent running balances across the inserted transactions — which requires either sophisticated editing or accepting mathematical errors they hope won't be noticed.

Category 4 — Wholesale fabrication. Fully fabricated PDFs, sometimes sold as a service, complete with a specific bank's header, footer, fonts, and formatting. These range from poor quality (caught by basic format checks) to professionally produced (requiring deep technical analysis to detect).

How to Build a Fraud-Resilient Bank Statement Analysis Pipeline illustration

Layer 1: Document Authenticity Checks

The first detection layer operates on the document itself before looking at any transaction data. PDF metadata analysis can reveal editing software fingerprints — a statement 'generated by HDFC NetBanking' but containing Creator fields from Adobe Acrobat or LibreOffice Writer is worth flagging. Font consistency checks can identify text that was typed into an existing document rather than generated by the bank's PDF engine. Digital signatures, where present, should be validated against known bank certificate authorities.

Layer 2: Mathematical Consistency

Every bank statement has a mathematically deterministic structure: Opening Balance + (Sum of Credits) - (Sum of Debits) = Closing Balance. Any deviation from this identity in the extracted transaction data is a definitive fraud signal — not a probable one, a definitive one. Running balance checks (does each transaction's running balance equal the previous balance plus or minus the transaction amount?) add another layer of mathematical validation.

These checks sound obvious, but a surprising proportion of fraudulent statements circulate with mathematical errors that would be caught by even a basic implementation. The fraudster assumed no one was checking.

Layer 3: Statistical Behavioural Analysis

The most powerful fraud detection operates not on document properties or mathematical identities, but on statistical patterns across transaction data. Genuine bank accounts have characteristic statistical signatures: the distribution of transaction amounts follows Benford's Law in predictable ways; the timing of transactions reflects actual banking behaviour; round-number transactions appear in characteristic proportions.

Fabricated transaction sequences frequently violate these statistical norms in detectable ways — too many round numbers, unusually uniform transaction spacing, amount distributions that don't match the claimed account type. These are signals that require quantitative analysis but are highly reliable when present.

Layer 4: Cross-Source Corroboration

The final and most robust detection layer is cross-source verification: using other available data to corroborate or challenge what the statement shows. GST turnover reconciliation, Form 26AS income data, ITR-declared income, and bureau-reported EMIs all provide external reference points. A statement showing ₹1.5 Cr of annual income that doesn't appear in any form 26AS entry, isn't supported by any GST turnover, and doesn't match any bureau-reported employer warrants significant scrutiny.

No single layer catches all fraud. A well-designed pipeline combines all four — document-level, mathematical, statistical, and cross-source — and presents findings to analysts as risk-weighted signals rather than binary pass/fail, enabling better risk-adjusted decisions rather than simple rejection of everything flagged.

Subhalaxmi Das

Co-founder & CTO

All Articles