Does this also extract semantic relationships and data dependencies between fields?
In the past I'd built an internal tool that transforms insurance PDFs to structured data. I wanted to extract explicit data dependencies between fields to perform validation.
Insurance forms can sometimes have 30-40 pages and they can have fields on page 40 that depend on fields on page 4 with a few nested if conditions. Would Parsewise be able to extract those relationships?
If yes, how do you do it for large documents?
gergelycsegzi 11 hours ago [-]
Yes, we do it by having multiple stages to the pipeline. First we would extract the independent data points (from say both page 4 and 40) and a second pass step establishes relationship (we call this resolution).
On the scale aspect, because we go in multiple passes, we break the scope into small enough pieces and then build it back up in a later step. Iirc the largest document I've seen a customer use was over 1k pages.
There are more complex data dependency scenarios where we find that the data that's extracted and combined (e.g. from page 4 and 40), needs to then be further transformed in different ways (e.g. having an evaluation and a clarification outcome at the end). To make these be aligned in value we are soon releasing a feature for what we call derived agents.
nilirl 6 hours ago [-]
1. Incredible! Can I make an unsolicited ask? If you had industry specific templates for standardized PDFs it would be easier for me to send Parsewise to the insurance companies I'd worked for. Something similar to https://www.useanvil.com/forms/?type=pdf-templates but with your clean, semantic data model.
2. Can I ask how? When I was building something like this, I realized there's an element of burning tokens for correctness. Meaning, splitting things into small units and small processes, each using a separate LLM output to be later combined. For a 1k page document, what kind of token usage do you see?
gergelycsegzi 5 hours ago [-]
Re 1 - that is a very kind offer! Our current public template library is very limited, so let me come back to you on this.
2. We see exactly the same thing. There is a trade-off in correctness vs token burning. However, some tokens (models) are cheaper and faster than others, so the small pieces can benefit from that. The token usage is also surprisingly variable, because it depends on the information density of the document and also on the information density of the question (e.g. is it a single needle in a haystack or are we analyzing the entire haystack from 10 perspectives). So the parsing for 1k pages may be on the order of millions of tokens, while a series of queries (extractions) on top of it could be 1-2 orders of magnitude more.
whinvik 1 days ago [-]
Document parsing is top of my mind lately because in some of the areas we work on the bottleneck is starting to become being able to query documents the same way one queries an api.
I keep thinking the most obvious analogue is we need some way to represent documents the same way we can represent structured data in parquet. Parquet allows easy range bases queries and there is so much tooling built around Arrow.
But for documents I keep hitting a wall to figure out what the right abstractions are. Parquet allows filterable metadata. But what such metadata is there for documents. Then there is the arbitrrariness of chunking, vectorization.
If we could just do this in a 2 step process where every document to process can be represented in a parquet like data format then I think we will atleast have the semblance of a solution.
gergelycsegzi 23 hours ago [-]
100% the really hard challenge is that the intermediate representation (ie the parquet equivalent) will be dependent on the given use case. So what we do with the platform is have the users configure the intermediate layer that serves most of their queries, and if they need to extend it we will suggest it for them. For example for the demo on the grounded reasoning benchmark I referred to, here is what the intermediate layer looks like on top of which the agents can more efficiently query: https://demo.parsewise.ai/projects/39bee9d8-d722-4b23-8894-e...
It’s much more limited in scope but fully open source and highly customisable. In fact it’s made for people to build their own pipelines on top of, providing the scaffolding needed to do so in a reliable way.
During development I’ve found it to be hard to truly generalise agent/llm-based data extraction, especially around the unlimited number of input types without task specific instructions (many files of the same kind, single large files, mixed kinds, bad quality files, docx/pdf/png/… the list goes on). Users sadly wanna upload all of these, and developers want a „one size fits all“ solution.
I am interested in how your solution deals with this. I came up with a strategy based approach so every task can be customised if needed, but I’d be delighted to see a technical writeup of how you deal with this endless variety of input + extraction task combos! :)
gergelycsegzi 24 hours ago [-]
I'll need to check it out!
We had the same observation in that the possible space is almost endless, and for example even for the same file type there may be different kind of processing required (e.g. an excel can be database style, vs small narrative heavy, or both).
We have baked in some ground processing rules for different kinds of documents, and we do allow custom instructions on how to deal with specific cases (e.g. translations, particular format layouts). The best write-up I have at the moment is https://www.parsewise.ai/doc-processing-pipelines but we're working on something that goes into more detail:)
chaitralikakde 1 days ago [-]
How portable are your agent definitions? If I build one for insurance documents, how much work is needed to adapt it to a completely different domain like legal contracts or healthcare?
gergelycsegzi 1 days ago [-]
In practice we find that each domain (and even each organisation) ends up having highly customized definitions.
At first, fairly generic templated definitions sort of work, but what we've seen is that over time data comes up that is out of distribution, and there was no explicit instruction on how to deal with it. In such cases we tend to flag this and offer suggestions to the users on how they can improve the specificity of agents.
Another structure we have seen play out is having a manager review ratings and feedback comments from their team and updating the definitions accordingly over time (where we offer them the capability to see results of before and after side -by-side for all existing data as well, so they are more confident in the change before committing).
The amount of work is dependent on how good the initial definitions are and how complex the use case is (and how much it evolves - new data sources etc). A bit of an unsatisfying answer but it can be anywhere between a few hours one off or a couple of minutes per day on an ongoing basis.
sixdimensional 17 hours ago [-]
Might be interested in orthogonal reading - "The Textual Warehouse" (ISBN-10: 163462954X) by data warehouse pioneer Bill Inmon. He is and always has been ahead of his time with his thinking!
gergelycsegzi 11 hours ago [-]
This does indeed look really interesting. We have deterministic validations (and some deterministic excel transformations) but using more deterministic transformations for text based on traditional NLP would be a nice complement.
dennis16384 20 hours ago [-]
"With experience and support from" is a nice landing trick!
How do you extract and relate to each other the facts from the documents that require comprehension and not simple similarity matching using common embeddings models?
gergelycsegzi 20 hours ago [-]
Haha thanks, the reader can try and guess which is which;)
We actually don't use embeddings or vector similarity, since those tend not to work well in specialist domains (e.g. for the OfficeQA benchmark where we have 90k pages talking about US treasury numbers, they would be mostly mapped to a very small embedding space because it's all the same topic, with small variations across years, expense categories etc.).
We use LLMs for the extraction and comparison as well, and we route between different models depending on the complexity of the comprehension of the given step required (and by this I mean routing between our pipeline steps; we currently do not dynamically try to judge individual cases for complexity like OpenRouter Fusion).
vinaigrette 1 days ago [-]
This looks great for digital humanities, specifically archival work. Would love to try it.
gergelycsegzi 1 days ago [-]
Fully agree, that's why we quite like the Databricks OfficeQA benchmark.. it made us experts on historical US treasuries haha
Some screenshots in here: https://www.parsewise.ai/officeqa-sota
vinaigrette 1 days ago [-]
I'm surprised at the low rate every model manages considering the (apparent) ease of the benchmarked document. Can your pipeline produce ground truth as a byproduct ? How do you think open-weight ocr models compare to the one showcased ? I've had good results with glm-ocr on complex documents (complex by their handwriting, pretty easy layouts).
What I like about your solution is the traceability of the information. A scruffy pipeline I used was gemini-flash 3.0 to pdf to notebook-lm (really amateurish work i know), but it yielded tremendeous time gains to extract info from documents (that could be borderline impossible to read for me). However, to trace back the info was obviously very tedious. But from my experience, notebooklm can now manage ocr/htr without a third party. I wonder how competitive your solution might be compared to messy workflows that work -- albeit with efforts -- but let's the researcher be "in contact" with the material.
What I really want is obviously an easy to setup local rag system, with the (very) light model that goes with it ... sweet dream.
gergelycsegzi 24 hours ago [-]
We were also surprised at first. The reason the models don't do so well is that they need to find information across 90k pages. When they are pointed to the right location they tend to do much better. And with these treasury documents grepping / keyword searching is almost impossible because everything appears thousands of times.
And thank you, we also love the traceability, it's one of the aspects that we have prioritized. Models will never be perfect so rather than building the best model harness we went for the best human harness haha.
Tbh it's been a while since I've looked at notebooklm so I expect it would have gotten better over time. One thing where I found it lacking in the past was the structure we could get out (which gives the traceability) - for example a deep dive on one the underlying data for this corpus: https://demo.parsewise.ai/projects/39bee9d8-d722-4b23-8894-e...
And yes, we're really excited whenever new open weights models come out that push quality, price, latency. We're finding that throughput is a big obstacle so I'm looking forward to more of this running locally, but it will be a while..
vmandrade 22 hours ago [-]
Interesting product! Do you think it would work for e-discovery? I have around 120GB of emails, contracts, and the like, and I need to search for data and where certain expressions are referenced.
gergelycsegzi 22 hours ago [-]
Potentially, but at that scale cost and latency may actually become an issue, so probably better to consider some sort of indexing or keyword searching.
dennis16384 20 hours ago [-]
clickhouse?
rogerthis 21 hours ago [-]
I am seeing my client using things like this heavily (not exactly this). Also, what I would call "business awareness" is declining.
gergelycsegzi 21 hours ago [-]
I can see why, it's tempting to go for full automation. The reason we go for fine grained sourcing is so that people can build their awareness quickly. Plus many of our customers work in regulated industries where full automation is prohibited.
gorgmah 1 days ago [-]
I worked recently on an internal tool to achieve this kind of things, mostly plugging mistral OCR to gemini to extract structured data from documents. We then perform automated diffs too.
There seems to be an insane amount of competition in the "Intelligent Document Processing" market, like for instance parseur, whose founder is often on HN himself.
What do you think sets you apart from competition like :
1) Mistral document AI : depending on the model, it looks way cheaper than yours, OCR model pricing ranges from 0.001 to 0.004 EUR / page and they have structured output wired in the OCR API if needed (things then get fed to one of their LLMs) + EU-based and GDPR ready
2) parseur / rossum / docsumo / nanonets (which is YC 2017) ?
joss82 1 days ago [-]
Hi, Parseur founder here :D
I understand what they are trying to do, but to me it feels like the moment when MongoDB entered the database space, with semi-structured, "flexible" storage format. It has its uses, for prototyping mostly.
But in high-volume, production workloads, giving a structure to the data you extract (what Parseur does through defining the Fields in your Mailbox, basically giving your output data a schema) adds a ton of value, and the larger the dataset, the truer it is.
Usually, you start by defining where you want your data to go, and which structure it should have, before working backwards from here and starting to extract the data. This is the key to automating your document workflow.
gergelycsegzi 1 days ago [-]
Hey, good point about structure for integrated workflows:)
Fully agree, for enterprises we need to guarantee types, flag discrepancies and provide underlying sources so they can integrate it downstream (whether that's Databricks, n8n etc.)
1. We are working with the assumption that OCR is (or soon will be) solved at super low prices.
So if we have the extracted data, what can we do with it?
Where we see Parsewise making a difference is for use cases that span across documents.
I.e. if you are extracting the same 5 fields from every invoice, there are lots of solutions as you listed (+ reducto etc). However, once you have a set of documents (e.g. an entire mortgage application package) and you are trying to get a structured response out, then your option is either an LLM API (if things fit into context and you are okay with limited citations), or building a pipeline with LLMs. I posted it in another comment but an example of trawling through 90k pages is here: https://www.parsewise.ai/officeqa-sota
2. While we rely on LLMs, the outcomes will be non-deterministic, so the bottleneck is and will remain the human verification (that is for somewhat complex use cases). The architecture that we have built is optimizing for the human reviewer to provide as granular values and citations as possible. This is either through our platform, or API clients.
oliver236 11 hours ago [-]
What about deterministic parsing?
Basically using templates to extract info from recurring doc structures ??
rdksu 1 days ago [-]
Hey ! Is this kind of like structured output over a large scale document corpora ?
gergelycsegzi 1 days ago [-]
Hey, that's exactly it!
red_hare 1 days ago [-]
I say this with a lot of love: The vibecoded applications in your demo reek of AI slop design.
This isn't a critique of your product. It's just that the a beige-orange theme, the pill components, and the left-border highlight give me that visceral reaction as reading a paragraph littered with em dashes and "not X but Y." It makes me take you less seriously.
Cool demo otherwise.
gergelycsegzi 1 days ago [-]
Haha no appreciate it! That's on me for not calling it out explicitly (was trying to make the video as short as possible), but the demo UIs were literally vibe coded to show the ease of integration https://youtu.be/F1cSuZal03s?si=1H4zTcO-8cosLbVr&t=70
mauryaudayan 1 days ago [-]
llamaparse also do it, what is different here?
gergelycsegzi 1 days ago [-]
Similar to my other comment, we assume that llamaparse and others can provide the individual page OCR. But once you have that the way that you can integrate it into your workflows often requires additional complexity around combining results from different sources. Here is a deeper dive I wrote on the complexities of building extraction pipelines: https://www.parsewise.ai/doc-processing-pipelines
maxhofer 1 days ago [-]
Mostly cross-doc reasoning at scale (e.g., 90k-page corpora) as opposed to doc-to-markdown conversions.
hnuser 22 hours ago [-]
Just use claude. Not another wrapper
gergelycsegzi 19 hours ago [-]
If Claude is good enough for your use case then for sure. If you need scale, persistent structure and verifiability we can help:)
I do respect your moderation, however I addressed the statement, the choice of words, not the person.
dang 20 hours ago [-]
To sarcastically reword someone's statements using fake quotation marks* to depict them as exploitative is at minimum an accusation of insincerity, and the snark adds an additional layer of aggressiveness. You also used "cognitive dissonance" as a trope to basically accuse them of lying. All this is personal and, since it was an attack, crosses into personal attack.
I learnt a lot at Palantir, though always worked in commercial so no ties to security state (for the better or worse).
(Also side-note, we are working towards enabling frontier performance with smaller open models that allows our customers to protect their data. https://www.parsewise.ai/officeqa-sota )
And I do get genuine joy from helping our users, so love it is:)
Johnny_Bonk 1 days ago [-]
[flagged]
dang 1 days ago [-]
A launch post is not a place to attack other users personally. Neither is any other HN thread for that matter, so please don't do it here.
Noted — and I did wish the founder success. I have no personal ill will towards them. But what I'd ask HN to consider is this: our world, and the technology we introduce into it, isn't apolitical or free of normative stakes and real, harmful implications for people. Treating where you've worked and what technology you've stewarded into being as an ethically neutral fact isn't neutral at all. What concerns me is that there's an increasing firewall against calling out things that ACTUALLLY harm people — while an objection gets reframed as a personal attack on someone willingly able to propagate problematic things. But this seems to be where the corporate tech world is moving as it cozies up to the authoritarians.
dang 20 hours ago [-]
Sure, and HN hosts many threads where people debate these points. We're not against that and often as not agree with them.
But this is a startup launch thread about something unrelated, and hounding someone about an ex-employer is a tenuous ground for bringing such material up. It's the sort of thing this guideline (from https://news.ycombinator.com/newsguidelines.html) asks people not to do, even apart from the personal aspect:
Do you ask this to all HNers who have worked at Meta, Google, Microsoft and Amazon - the latter three who Palantir relies on to even exist?
I.e. half of HN?
Johnny_Bonk 5 hours ago [-]
no but we're all implicated including myself
gergelycsegzi 1 days ago [-]
Planning to serve good things for sure, and appreciate your note.
Ofc I didn't agree with everything Palantir was doing (also to the extent that we even knew about them at the time). I was working on vaccine distribution and cancer research as well, so definitely felt like helping.
Rendered at 18:04:55 GMT+0000 (Coordinated Universal Time) with Vercel.
In the past I'd built an internal tool that transforms insurance PDFs to structured data. I wanted to extract explicit data dependencies between fields to perform validation.
Insurance forms can sometimes have 30-40 pages and they can have fields on page 40 that depend on fields on page 4 with a few nested if conditions. Would Parsewise be able to extract those relationships?
If yes, how do you do it for large documents?
On the scale aspect, because we go in multiple passes, we break the scope into small enough pieces and then build it back up in a later step. Iirc the largest document I've seen a customer use was over 1k pages.
There are more complex data dependency scenarios where we find that the data that's extracted and combined (e.g. from page 4 and 40), needs to then be further transformed in different ways (e.g. having an evaluation and a clarification outcome at the end). To make these be aligned in value we are soon releasing a feature for what we call derived agents.
2. Can I ask how? When I was building something like this, I realized there's an element of burning tokens for correctness. Meaning, splitting things into small units and small processes, each using a separate LLM output to be later combined. For a 1k page document, what kind of token usage do you see?
2. We see exactly the same thing. There is a trade-off in correctness vs token burning. However, some tokens (models) are cheaper and faster than others, so the small pieces can benefit from that. The token usage is also surprisingly variable, because it depends on the information density of the document and also on the information density of the question (e.g. is it a single needle in a haystack or are we analyzing the entire haystack from 10 perspectives). So the parsing for 1k pages may be on the order of millions of tokens, while a series of queries (extractions) on top of it could be 1-2 orders of magnitude more.
I keep thinking the most obvious analogue is we need some way to represent documents the same way we can represent structured data in parquet. Parquet allows easy range bases queries and there is so much tooling built around Arrow.
But for documents I keep hitting a wall to figure out what the right abstractions are. Parquet allows filterable metadata. But what such metadata is there for documents. Then there is the arbitrrariness of chunking, vectorization.
If we could just do this in a 2 step process where every document to process can be represented in a parquet like data format then I think we will atleast have the semblance of a solution.
It’s much more limited in scope but fully open source and highly customisable. In fact it’s made for people to build their own pipelines on top of, providing the scaffolding needed to do so in a reliable way.
During development I’ve found it to be hard to truly generalise agent/llm-based data extraction, especially around the unlimited number of input types without task specific instructions (many files of the same kind, single large files, mixed kinds, bad quality files, docx/pdf/png/… the list goes on). Users sadly wanna upload all of these, and developers want a „one size fits all“ solution.
I am interested in how your solution deals with this. I came up with a strategy based approach so every task can be customised if needed, but I’d be delighted to see a technical writeup of how you deal with this endless variety of input + extraction task combos! :)
We had the same observation in that the possible space is almost endless, and for example even for the same file type there may be different kind of processing required (e.g. an excel can be database style, vs small narrative heavy, or both).
We have baked in some ground processing rules for different kinds of documents, and we do allow custom instructions on how to deal with specific cases (e.g. translations, particular format layouts). The best write-up I have at the moment is https://www.parsewise.ai/doc-processing-pipelines but we're working on something that goes into more detail:)
At first, fairly generic templated definitions sort of work, but what we've seen is that over time data comes up that is out of distribution, and there was no explicit instruction on how to deal with it. In such cases we tend to flag this and offer suggestions to the users on how they can improve the specificity of agents.
Another structure we have seen play out is having a manager review ratings and feedback comments from their team and updating the definitions accordingly over time (where we offer them the capability to see results of before and after side -by-side for all existing data as well, so they are more confident in the change before committing).
The amount of work is dependent on how good the initial definitions are and how complex the use case is (and how much it evolves - new data sources etc). A bit of an unsatisfying answer but it can be anywhere between a few hours one off or a couple of minutes per day on an ongoing basis.
How do you extract and relate to each other the facts from the documents that require comprehension and not simple similarity matching using common embeddings models?
We actually don't use embeddings or vector similarity, since those tend not to work well in specialist domains (e.g. for the OfficeQA benchmark where we have 90k pages talking about US treasury numbers, they would be mostly mapped to a very small embedding space because it's all the same topic, with small variations across years, expense categories etc.).
We use LLMs for the extraction and comparison as well, and we route between different models depending on the complexity of the comprehension of the given step required (and by this I mean routing between our pipeline steps; we currently do not dynamically try to judge individual cases for complexity like OpenRouter Fusion).
What I like about your solution is the traceability of the information. A scruffy pipeline I used was gemini-flash 3.0 to pdf to notebook-lm (really amateurish work i know), but it yielded tremendeous time gains to extract info from documents (that could be borderline impossible to read for me). However, to trace back the info was obviously very tedious. But from my experience, notebooklm can now manage ocr/htr without a third party. I wonder how competitive your solution might be compared to messy workflows that work -- albeit with efforts -- but let's the researcher be "in contact" with the material.
What I really want is obviously an easy to setup local rag system, with the (very) light model that goes with it ... sweet dream.
And thank you, we also love the traceability, it's one of the aspects that we have prioritized. Models will never be perfect so rather than building the best model harness we went for the best human harness haha.
Tbh it's been a while since I've looked at notebooklm so I expect it would have gotten better over time. One thing where I found it lacking in the past was the structure we could get out (which gives the traceability) - for example a deep dive on one the underlying data for this corpus: https://demo.parsewise.ai/projects/39bee9d8-d722-4b23-8894-e...
And yes, we're really excited whenever new open weights models come out that push quality, price, latency. We're finding that throughput is a big obstacle so I'm looking forward to more of this running locally, but it will be a while..
There seems to be an insane amount of competition in the "Intelligent Document Processing" market, like for instance parseur, whose founder is often on HN himself.
What do you think sets you apart from competition like : 1) Mistral document AI : depending on the model, it looks way cheaper than yours, OCR model pricing ranges from 0.001 to 0.004 EUR / page and they have structured output wired in the OCR API if needed (things then get fed to one of their LLMs) + EU-based and GDPR ready 2) parseur / rossum / docsumo / nanonets (which is YC 2017) ?
I understand what they are trying to do, but to me it feels like the moment when MongoDB entered the database space, with semi-structured, "flexible" storage format. It has its uses, for prototyping mostly.
But in high-volume, production workloads, giving a structure to the data you extract (what Parseur does through defining the Fields in your Mailbox, basically giving your output data a schema) adds a ton of value, and the larger the dataset, the truer it is.
Usually, you start by defining where you want your data to go, and which structure it should have, before working backwards from here and starting to extract the data. This is the key to automating your document workflow.
Fully agree, for enterprises we need to guarantee types, flag discrepancies and provide underlying sources so they can integrate it downstream (whether that's Databricks, n8n etc.)
Here is our documentation for working with a fixed JSON schema: https://docs.parsewise.ai/api#schema-driven-extract-convenie...
1. We are working with the assumption that OCR is (or soon will be) solved at super low prices.
So if we have the extracted data, what can we do with it? Where we see Parsewise making a difference is for use cases that span across documents. I.e. if you are extracting the same 5 fields from every invoice, there are lots of solutions as you listed (+ reducto etc). However, once you have a set of documents (e.g. an entire mortgage application package) and you are trying to get a structured response out, then your option is either an LLM API (if things fit into context and you are okay with limited citations), or building a pipeline with LLMs. I posted it in another comment but an example of trawling through 90k pages is here: https://www.parsewise.ai/officeqa-sota
2. While we rely on LLMs, the outcomes will be non-deterministic, so the bottleneck is and will remain the human verification (that is for somewhat complex use cases). The architecture that we have built is optimizing for the human reviewer to provide as granular values and citations as possible. This is either through our platform, or API clients.
Basically using templates to extract info from recurring doc structures ??
This isn't a critique of your product. It's just that the a beige-orange theme, the pill components, and the left-border highlight give me that visceral reaction as reading a paragraph littered with em dashes and "not X but Y." It makes me take you less seriously.
Cool demo otherwise.
https://news.ycombinator.com/newsguidelines.html
(* also not allowed here btw: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...)
And I do get genuine joy from helping our users, so love it is:)
https://news.ycombinator.com/newsguidelines.html
But this is a startup launch thread about something unrelated, and hounding someone about an ex-employer is a tenuous ground for bringing such material up. It's the sort of thing this guideline (from https://news.ycombinator.com/newsguidelines.html) asks people not to do, even apart from the personal aspect:
"Eschew flamebait. Avoid generic tangents."
More about that here in case helpful: https://news.ycombinator.com/item?id=48750103
I.e. half of HN?