📄️ Beautiful Soup
Beautiful Soup is a Python package for parsing
📄️ Google Cloud Document AI
Document AI is a document understanding platform from Google Cloud to transform unstructured data from documents into structured data, making it easier to understand, analyze, and consume.
📄️ Doctran: extract properties
We can extract useful features of documents using the Doctran library, which uses OpenAI's function calling feature to extract specific metadata.
📄️ Doctran: interrogate documents
Documents used in a vector store knowledge base are typically stored in a narrative or conversational format. However, most user queries are in question format. If we convert documents into Q&A format before vectorizing them, we can increase the likelihood of retrieving relevant documents, and decrease the likelihood of retrieving irrelevant documents.
📄️ Doctran: language translation
Comparing documents through embeddings has the benefit of working across multiple languages. "Harrison says hello" and "Harrison dice hola" will occupy similar positions in the vector space because they have the same meaning semantically.
📄️ Google Translate
Google Translate is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another.
📄️ HTML to text
html2text is a Python package that converts a page of HTML into clean, easy-to-read plain ASCII text.
📄️ Nuclia
Nuclia automatically indexes your unstructured data from any internal and external source, providing optimized search results and generative answers. It can handle video and audio transcription, image content extraction, and document parsing.
📄️ OpenAI metadata tagger
It can often be useful to tag ingested documents with structured metadata, such as the title, tone, or length of a document, to allow for a more targeted similarity search later. However, for large numbers of documents, performing this labelling process manually can be tedious.