Yesterday Rand Fishkin, Co-Founder of Moz, and Mike King, iPullRank’s CEO, released a leak of a Google ContentWarehouse API. You can read the original article here.
Having gone through the data a little, my feeling is that it’s a system for document retrieval based on Google search. Some fields sound like they would be useful for enterprise search rather than Google search. Perhaps an enterprise search product was created by forking core Google code. Or maybe Google can be used as an enterprise search solution.
I wanted to analyse the documentation to find anything useful, so I extracted the text and ran it through an LLM. I used, ironically, Google Gemini. I’m publishing this here in case it’s interesting to others and to see what ideas people have.
Initial Findings
When I asked Gemini for anything relevant to SEO in the google leak, it chose the following. Note that it’s an LLM, so it might make stuff up.
- PageRank: The documentation mentions “pagerank” as a signal used in ranking, but it is marked as deprecated. It is unclear if this is still a factor in Google Search.
- Anchor Text: The documentation mentions “Anchor Text” as a ranking signal used for document scoring. The “anchor” field contains “the preferred way to link to an element.” This highlights the importance of using descriptive anchor text when linking to web pages.
- Website and Domain Signals: The documentation mentions several signals that are related to the website and domain a page is on:
- Host ID: Used to determine if a page is part of a parked domain.
- SiteChunk: A sitechunk is used to group pages together for NSR (Near-Duplicate Site Rank). This indicates that site-level signals are a factor in ranking.
- NSR (Near-Duplicate Site Rank): This is a site-level signal, and the documentation mentions how it is used in “Quality* (Q*)” ranking.
- Site Spam Score: The documentation refers to a “Site Spam Score” as a signal used in SafeSearch. This suggests that Google considers spam signals at the site level, not just the page level.
- Site Quality Features: The documentation mentions a “Site Quality Features” proto, which suggests that Google considers signals about the overall quality of a website.
- Content Quality: The documentation mentions several signals related to content quality:
- Original Content Score: This score is used to identify pages with little content.
- SpamWordScore: This score measures the likelihood that a page uses spammy keywords.
- GibberishScore: This score identifies pages that contain gibberish content.
- UAC Spam Score: This score identifies pages that violate Google’s user-generated content spam policies.
- SpamBrain LAVC Score: This score is used to identify pages that are classified as spam by Google’s SpamBrain system.
- Freshness:
- Last Significant Update: This signal tracks the last time a page was significantly updated.
- Change Period: This signal estimates the average time between changes on a page.
- SemanticDate: This signal estimates the date of a page’s content based on its content, anchors, and related documents.
- Technical SEO:
- Mobile Friendliness: The documentation mentions a “SmartphonePerDocData” proto which contains signals related to a page’s mobile-friendliness, indicating this is a critical ranking factor.
- HTTPS: The documentation mentions “HTTPS” as a ranking factor.
- Other Considerations:
- Content Language: The documentation emphasizes the importance of language, with several fields dedicated to language detection and indexing. This suggests the importance of serving content in the correct language for the targeted audience.
- Title Tag: The documentation mentions “title” as a field that is extracted from the title tag of the content. This underscores the importance of using relevant and descriptive title tags for each page.
- Meta Description: The documentation mentions “meta description” and “body title,” suggesting that Google uses these elements for ranking and snippet display.
- Robots.txt: The documentation mentions “robots.txt” as a way for website owners to control how Google crawls their site.
- Structured Data: The documentation mentions several protos that relate to structured data, including schema.org markup, suggesting that using structured data on websites is important for SEO.
Some of it isn’t very interesting and has already been covered, but some things are really interesting. Especially Domain Signals, Content Quality and Freshness. The points from the original article’s analysis don’t show up:
- Navboost and click data
- Use of Chrome data
- Whitelists
- Quality rater data
An actual human reading through the documentation can understand things in a way the LLM probably won’t. There’s too much for anyone to read in detail though.
For each of the points the LLM highlighted from the Google leak, more data can easily be found. Prompting the LLM with more specific questions can get more details. Just searching through the docs for these terms throws up more information too.
Searching the docs
As an example, pagerank is interesting, and it’s mentioned ~30 times in the docs. Gemini thinks it’s deprecated because that’s mentioned in one place, but this doesn’t mean it’s generally deprecated. Just searching through the docs for this string throws up some interesting things:
According to anchor quality bucket, anchor with pagrank > 51000 is the best anchor. anchors with pagerank < 47000 are all same.
More accuracy than 1-10!
toolbarPagerank (type: integer(), default: nil) – A copy of the value stored in /namespace/indexing/wwwglobal//fakepr/* for this document. A value of quality_bakery::FakeprUtils::kUnknownToolbarPagerank indicates that we don’t have toolbar pagerank for this document. A value between 0 and 10 (inclusive) means that this is the toolbar pagerank of the page. Finally, if this value is not set it means that the toolbar pagerank is equivalent to: quality_bakery::FakeprUtils::EstimatePreDemotionFromPagerankNearestSeeds
It used to be shown in the Google Toolbar. Calling it fakepr suggests what was shown in the toolbar wasn’t directly pagerank.
homepagePagerankNs (type: integer(), default: nil) – The page-rank of the homepage of the site. Copied from the cdoc.doc().pagerank_ns() of the homepage.
Another point for domain authority.
crawlPagerank (type: integer(), default: nil) – This field is used internally by the docjoiner to forward the crawl pageranks from original canonicals to canonicals we actually chose; outside sources should not set it, and it should not be present in actual docjoins or the index.
Is this talking about how pagerank is passed between pages?
More specific prompting
I thought linking and content quality would be especially interesting topics, so I asked the AI about those. Here’s what it found from the Google leak:
Linking
I asked Gemini to find information relevant to linking strategies. Here’s the summary:
- Anchor text:
- `anchorText` (in `Anchors`, `AnchorStats`, `TargetLink`, `TargetGroup`, `TrawlerFetchReplyData`, `ScienceCitation`, `RepositoryWebrefSimplifiedAnchor`, `RepositoryWebrefOutlinkInfo`, `RepositoryWebrefMention`, `VideoAnchorSets`): This field captures the text used for linking to a specific document from other pages, often referred to as anchor text. It plays a crucial role in SEO by providing context and relevance signals to search engines about the target page.
- `anchorIndices` (in `RepositoryWebrefSimplifiedAnchor`): This field identifies a set of equivalent anchors, indicating that multiple anchors point to the same target page, which can be leveraged for SEO purposes to enhance link authority.
- Link Attributes:
- isNofollow (in RepositoryWebrefOutlinkInfo): This attribute indicates if a link is a nofollow link, which tells search engines not to transfer link juice to the target page. Understanding this attribute can help optimize link strategies.
- linkWeight (in RepositoryWebrefEntityLinkMetadata, RepositoryWebrefWebrefEntityRelationship): This field reflects the importance or strength of a link, potentially indicating higher relevance for SEO.
- topicalityWeight (in RepositoryWebrefOutlinkInfo): This field reflects the topicality of a link, which is important for SEO as it helps determine if the link aligns with the target page’s topic.
- Entity Relationships:
- annotatedRelationship (in RepositoryWebrefWebrefEntity): This field stores information about relationships between annotated entities, which can be used to understand the context and connections between linked pages for SEO purposes.
- URL Information:
- url (in numerous fields): The url field represents the URL of the linked document, which is essential for SEO as it serves as the identifier for the target page.
- forwardingUrls (in RepositoryWebrefDocumentInfo): These URLs indicate pages that redirect to the target page. Search engines need to understand these redirects to ensure accurate indexing.
- isDisambiguationPage (in RepositoryWebrefDocumentInfo): This field indicates if the linked page is a disambiguation page. Understanding this can help avoid unwanted links to disambiguation pages, which can dilute link authority.
- displayUrl (in RepositoryWebrefWebrefOutlinkInfo, QualitySitemapSubresultList, IndexingConverterSourceUrlDocjoinInfo, StorageGraphBfgLivegraphProvenanceMetadata, SocialCommonAttachmentAttachment, GenericSnippetResponse, RepositoryWebrefGlobalLinkInfo): This field represents the URL that is displayed to users. It can be used to optimize link presentation for SEO, such as using shortened URLs.
- linkTarget (in SocialCommonAttachmentAttachment): This is the URL to navigate to when clicked, important for SEO as it ensures accurate linking behavior.
- Document Metadata:
- docid (in numerous fields): This field represents a unique identifier for a document, crucial for SEO as it enables search engines to track and identify specific pages.
- title (in numerous fields): The title of the linked document, a key factor in SEO as it provides a concise description of the page’s content.
- description (in numerous fields): The description of the linked document, used for SEO as it offers additional context and information about the page.
- salientTerms (in RepositoryWebrefDocumentInfo, IndexingConverterSourceUrlDocjoinInfo): These terms represent the most important words or phrases on the page, which can inform link building strategies.
- category (in RepositoryWebrefWebrefEntities, RepositoryWebrefCategoryInfo): Categories associated with the entity, useful for understanding the topic of the linked page for SEO.
- topEntity (in IndexingConverterSourceUrlDocjoinInfo, RepositoryWebrefWebrefEntities): Identifies the most important entity on the linked page, which can be leveraged for SEO purposes to enhance relevance.
- language (in numerous fields): The language of the linked document, important for SEO as it enables search engines to understand the language of the content and target the appropriate audience.
- Additional Information:
- referencePageScores (in RepositoryWebrefEntityAnnotations): This field provides signals used for identifying reference pages, which are authoritative pages for a specific entity. Understanding these signals can inform link building to reference pages for SEO.
Again, others have covered some of this and some is a bit dull. topicalityWeight is interesting in terms of links. As are Entity Relationships, salientTerms, topEntity and especially referencePageScores. I probably need to think of a better prompt.
Content Quality
This is interesting considering the recent “Helpful Content” Update. I asked it what actions I could take to improve content quality. This means the output is quite different:
Content Quality & Relevance
- Improve Content Accuracy: Ensure all information within your content is accurate and up-to-date. This includes things like names, addresses, phone numbers, and business details.
- Use Semantic Legos: Leverage the “Semantic Legos” feature to annotate your content with semantic relationships between entities. This helps the search engine better understand the meaning and context of your content.
- Optimize for Salient Terms: Pay attention to “Salient Terms” as they indicate important concepts and phrases within your content. Ensure your content uses these terms strategically.
- Improve Entity Recognition: Help the search engine correctly identify and categorize entities within your content, such as people, places, and organizations. Use clear and accurate naming conventions, and provide structured data where possible.
- Target Relevant Categories: Ensure your content is categorized correctly within relevant HitCat categories. This helps the search engine understand your content’s focus and relevance.
- Avoid Spam: Use “SpamWordScore” and other spam-related signals to identify and address spammy content.
- Manage Content Freshness: Maintain the freshness of your content by updating it regularly. Use “lastSignificantUpdate” and “contentage” fields to understand the age of your content.
- Address Unreliable Dates: Correct any unreliable dates in your content. Use “DateExposureScore” and other related fields to identify problematic dates.
- Optimize for Featured Snippets: Strive to create content that is eligible for “Featured Snippets” by providing concise and informative answers to common queries.
Technical SEO
- Optimize for Mobile Devices: Ensure your website is mobile-friendly. Consider using “isSmartphoneOptimized” and other related fields to assess your website’s mobile compatibility.
- Improve Page Speed: Pay attention to “CoreWebVitals” and other page speed metrics. Ensure your website loads quickly for users, improving user experience and SEO ranking.
- Improve HTML Rendering: Ensure your HTML renders correctly and produces high-quality “SnapshotDocument” output.
- Manage Image Quality: Optimize your images for web use. Use “ImageRepositoryShoppingProductInformation” to provide product-specific information about your images, enhancing their discoverability.
- Leverage Schema.org: Use Schema.org vocabulary to markup your content and provide structured data about your website and its entities. This helps the search engine understand and display your content more effectively.
- Use Robots.txt: Use “RobotsInfo” and “RobotsTxt” to manage crawling and indexing behavior for your website.
- Implement HSTS: Utilize HSTS (HTTP Strict Transport Security) to ensure your website is accessed securely via HTTPS.
- Manage Redirects: Use “Redirects” to effectively manage URL redirects on your website, preventing broken links and ensuring users reach the intended pages.
- Optimize for Video Search: Utilize the “VideoContentSearch” features to create and optimize video content for SEO.
Advanced SEO
- Use Live Results: Consider using “LiveResults” for time-sensitive content. Ensure your content is timely and relevant to current events.
- Implement Rich Snippets: Leverage “RichsnippetsPageMap” to enhance your website’s visibility in search results with rich snippets.
- Manage Social Signals: Pay attention to social signals, such as “socialgraphNodeNameFp”, which can influence your website’s SEO performance.
- Leverage Google Assistant Capabilities: Understand the capabilities of Google Assistant and optimize your content for voice search queries.
Another mixed bag. “Semantic Legos” sounds interesting, but looks to just be something to do with Youtube. SpamWordScore is also interesting – is there a black list?
Freshness is mentioned again. I’ve no idea what it’s on about with regard to ‘DateExposureScore’.
I looked up LiveResults, which led to this:
Page boosting using Live Results data. Attaching this proto to a document indicates that a Live Result feed exists for that document, and that this feed has sufficient reliability to warrant potential boosting of the document rank. Next ID: 3
- boostLevel (type: number(), default: nil) – Identifies the degree to which the existence of this LiveResult should boost a query’s score (when the query is performed within the hot_times range). This field is always in the range [0,1]. A missing field, a value of 0, or a value outside the legal range indicates that no boosting is performed. A value of 1 indicates that the maximum level of boosting will be applied. This field will be updated from time to time based on CTR and other signals.
- hotTimes (type: GoogleApi.ContentWarehouse.V1.Model.WeboftrustTimeRange.t, default: nil) – Specifies the time range within which this LiveResult is relevant. Used for deciding whether the rank of the corresponding page should be boosted. For example, this range can be set to encompass a few days before and after a sports game to which the page refers. If this field is missing, no boosting is performed. Specifying that a page should always be boosted is not recommended, but can be accomplished by setting hot_times.start_unix_time=0 and hot_time.end_unix_time=0x7fffffff. Note that multiple time ranges can be implemented by adding several LiveResultDocBoostData messages to the LiveResultsDocAttachments proto. If overlapping time ranges are used, the proto containing the highest boost level will be used.
It sounds very manual. Maybe for things like the Superbowl and World Cup.
If the output isn’t so exciting, looking up the docs for the things it mentions does seem to get to more interesting things. E.g. socialgraphNodeNameFp:
socialgraphNodeNameFp (type: String.t, default: nil) – For Social Search we store the fingerprint of the SG node name. This is used in one of the superroot’s PRE_DOC twiddlers as a lookup key for the full Social Search data. PRE_DOC = twiddlers firing before the DocInfo request is sent to the mustang backend
I imagine people will be talking about twiddlers quite a bit. The next thing in the docs is DocLevelSpamScore, This area of the docs seems to have a bunch of interesting things.
IsAnchorBayesSpam (type: boolean(), default: nil) – Is this document considered spam by the anchor bayes classifier?
semanticDate (type: integer(), default: nil) – SemanticDate, estimated date of the content of a document based on the contents of the document (via parsing), anchors and related documents
datesInfo (type: String.t, default: nil) – Stores dates-related info (e.g. page is old based on its date annotations). Used in FreshnessTwiddler
A freshness twiddler…
lastSignificantUpdate, spamrank (the likelihood that this document links to known spammers), trendspamScore (For now, the count of matching trendspam queries). There’re lots of interesting things in here. nsrIsCovidLocalAuthority – must be relatively recent data related to Google search (you wouldn’t have this in an enterprise search solution). There’s so much there (I don’t know how it all ended up on one line), it’s impossible to read through and summarise it all.
Spam
I read so many mentions of spam that I had to ask it how to avoid being seen as spam. Gemini highlights these attributes:
Here are the actions you can take to avoid having your document categorized as spam by Google, along with explanations of what each attribute means:
- Action: Avoid using images that have a high finalPornScore.
- finalPornScore is a Google-internal metric that indicates the likelihood that an image contains adult content. A higher score suggests a greater likelihood of pornographic content.
- Action: Avoid using images that are flagged as isUnwantedContent.
- isUnwantedContent is a Google-internal attribute indicating that an image has been identified as undesirable for Google Search. This could be due to various factors like low quality, irrelevant content, or potential policy violations.
- Action: Ensure that images are isIndexedByImagesearch.
- isIndexedByImagesearch is a Google-internal attribute indicating that an image has been successfully processed by Google’s image indexing system.
- Action: Avoid using websites that are flagged as isSitePorn.
- isSitePorn is a Google-internal metric that indicates the likelihood that an entire website contains adult content. A higher score suggests a greater likelihood of pornographic content across the site.
- Action: Avoid using documents where spamtokensContentScore is above a specific threshold.
- spamtokensContentScore is an internal Google metric that evaluates the presence of spammy keywords and patterns within a document’s content. A higher score suggests a greater likelihood of spam.
- Action: Avoid using documents where uacSpamScore is above a specific threshold.
- uacSpamScore is an internal Google metric that evaluates the presence of spammy content and techniques within a document, focusing on user-generated content.
- Action: Avoid using documents where DocLevelSpamScore is above a specific threshold.
- DocLevelSpamScore is an internal Google metric that represents the overall likelihood of a document being spam.
- Action: Avoid using documents where spamrank is above a specific threshold.
- spamrank is an internal Google metric that assesses the likelihood that a document is linked to or associated with known spam websites.
- Action: Avoid using documents with a high KeywordStuffingScore.
- KeywordStuffingScore is an internal Google metric that evaluates the density of keywords within a document’s content. A high score suggests the use of keywords in an unnatural or manipulative way.
- Action: Avoid using documents where OriginalContentScore is below a specific threshold.
- OriginalContentScore is an internal Google metric that assesses the originality and value of a document’s content.
- Action: Ensure that crawlerIdProto.language is set to a valid language code.
- crawlerIdProto.language represents the language code assigned to a document by Google’s crawlers.
- Action: Ensure that crawlerIdProto.country is set to a valid country code.
- crawlerIdProto.country represents the country code assigned to a document by Google’s crawlers.
- Action: Avoid using documents that are flagged as isRoboted.
- isRoboted is an internal Google attribute indicating that a document has been identified as disallowed for crawling by Google.
- Action: Avoid using documents where QuarantineInfo indicates quarantine.
- QuarantineInfo is a Google-internal attribute indicating that a document has been identified as potentially harmful or spammy and has been quarantined.
- Action: Avoid using images that have a high finalViolenceScore.
- finalViolenceScore is a Google-internal metric that indicates the likelihood that an image contains violent or graphic content.
- Action: Avoid using images that have a high brainPornScores.porn_score.
- brainPornScores.porn_score is a Google-internal metric that indicates the likelihood that an image contains adult content, using Google’s image recognition models.
- Action: Ensure that finalPornScoreVersion, brainPornScoresVersion, and finalViolenceScoreVersion are set to valid version strings.
- These are internal Google attributes that indicate the version of Google’s algorithms used for content analysis (porn, violence, etc.).
- Action: Avoid using images that are flagged as isIipInScope.
- isIipInScope is a Google-internal attribute indicating that an image is associated with a set of restricted entities that are considered inappropriate.
- Action: Ensure that contentRestriction is not set to a value that indicates a restriction.
- contentRestriction is an internal Google attribute that indicates whether a document is subject to content-based restrictions.
- Action: Ensure that moderationReason is not set to a value that indicates a restriction.
- moderationReason is an internal Google attribute that provides the reason for why a document has been restricted.
- Action: Ensure that moderationInfo.moderationReason is not set to a value that indicates a restriction.
- moderationInfo.moderationReason is an internal Google attribute that provides the reason for why a comment has been moderated.
- Action: Ensure that moderationInfo.moderatorType is not set to a value that indicates a restriction.
- moderationInfo.moderatorType is an internal Google attribute that indicates the type of system that moderated a comment.
- Action: Ensure that deleteReason is not set to a value that indicates a restriction.
- deleteReason is an internal Google attribute that provides the reason for why a comment has been deleted.
- Action: Ensure that PolicyData.policy is not set to a value that indicates a restriction.
- PolicyData.policy is an internal Google attribute that indicates any restrictions applied to a comment, e.g. for legal reasons or based on user settings.
- Action: Avoid using documents where SpamWordScore is above a specific threshold.
- SpamWordScore is an internal Google metric that evaluates the presence of spammy keywords within a document.
- Action: Avoid using documents where ScaledSpamScoreYoram is above a specific threshold.
- ScaledSpamScoreYoram is an internal Google metric that evaluates the overall spamminess of a document.
I gave it another shot with a different prompt:
- Ensure your document’s crawlStatus is not ROBOTED. The crawlStatus attribute indicates the crawling status of the document. A ROBOTED status means the document is disallowed for crawling according to the host’s robots.txt. This is a major indicator of spam and could lead to your document being penalized.
- Avoid using spamtokensContentScore. This attribute is specifically for identifying UGC spam, and using it could trigger a spam classification.
- Avoid setting isUnwantedContent to true in ImageData. This attribute indicates the image is unwanted for the search index and will likely lead to your document being penalized.
- Avoid generating a high finalPornScore. A high finalPornScore in ImageData suggests your document is pornographic, which will lead to penalties.
- Avoid having a high DocLevelSpamScore. This attribute indicates the document is likely spam, so make sure your document doesn’t have a high score in this field.
- Avoid triggering BadBackLinks. If your document triggers BadBackLinks, your improvanchor score will not be used by Google search, which can hurt your ranking.
- Avoid having a high number of spammy anchor phrases. phraseAnchorSpamCount, phraseAnchorSpamDays, and phraseAnchorSpamFraq are all attributes that indicate spammy anchor phrases.
- Avoid having a high SpamWordScore. This attribute suggests your document is likely spam due to keyword stuffing.
- Avoid having a high uacSpamScore. This attribute is used to identify UGC spam, so keep it low.
- Avoid having a high GibberishScore. A high GibberishScore suggests that your document contains nonsensical content, which could be a sign of spam.
Some of it is interesting, but the LLM struggles to understand what the attributes are and how to interpret them.
Methodology
It’s a lot of data (~12MB of text / code) and it’s sat on a GitHub repository. Here’s what I did to get something to input into the AI.
1. Download the files
I wanted to just grab the files from that branch and directory
git clone --filter=blob:none --no-checkout https://github.com/yoshi-code-bot/elixir-google-api.git
cd elixir-google-api
git sparse-checkout init --cone
git sparse-checkout set clients/content_warehouse
git checkout d7a637f4391b2174a2cf43ee11e6577a204a161e
2. Compile the documentation
I figured the interesting bits would be the documentation included inline in the code. It’s written in Elixir and theoretically you can just generate the docs. This turned out to be a bit annoying.
3. Extract the text
The documentation is html and includes navigation and such. I just wanted the text, so I ran it through unfluff. This tries to find the main text from web pages. It might throw relevant things away, but it probably did a decent job.
Install node and unfluff:
sudo apt install nodejs # Install nodeJS
sudo sudo apt install npm # Install NPM
npm install -g unfluff # Install unfluff globally
Extract all the text:
find . -type f -name "*.html" -exec sh -c 'unfluff "$1"; echo; echo' _ {} \; > output.txt
You can find the extracted text here.
4. Pass it through Gemini
First off I used Anthropic’s prompt generator to write a nice prompt, and then just pasted in the text. For example, enter “Here is documentation about Google search. Find everything relevant to SEO” in the prompt generator.
Here is some documentation text:
<documentation>
{{DOCUMENTATION}}
</documentation>
Your task is to find all parts of this documentation that are relevant to the following query:
<query>{{QUERY}}</query>
To complete this task:
1. Carefully read through the full documentation text.
2. Identify all sections that contain information relevant to answering the query. The query is looking for a specific subset or aspect of the overall documentation.
3. Extract the text from each relevant section and include it inside <relevant_section> tags. Aim to only include the most essential sentences needed to answer the query. Omit any irrelevant or tangential information.
<relevant_sections>
<relevant_section>
[Extracted relevant text 1]
</relevant_section>
<relevant_section>
[Extracted relevant text 2]
</relevant_section>
...
</relevant_sections>
4. After extracting the relevant sections, write a concise summary of the key information in those sections that answers the query. Include this summary inside <answer> tags.
<answer>
[Summary of relevant information]
</answer>
If there are no sections of the documentation relevant to the query, simply write:
<relevant_sections>None</relevant_sections>
<answer>The provided documentation does not contain any information relevant to the query.</answer>
Check the output makes sense (what it came up with above seems a bit much…), and then paste it into Gemini (don’t forget to replace {{DOCUMENTATION}}
with the extracted text).
It’s still a lot of text – 350k tokens. I used Gemini Flash, and I think this keeps it in the free tier. If not, it should only cost around 25c per request (assuming I understood the documentation and did the maths correctly).
Further work
- Fix the spelling mistakes and errors I’ve presumably made above.
- The most obvious thing to do is ask more questions and come up with better prompts.
- Use a better LLM (though it’ll cost more).
- It might also be worth trying to extract the text in a better way, e.g. use formatting and check that no relevant information is being thrown away.