Getting Started
Subworkflow.ai helps you build Document RAG and VLM applications.
Before you being...
Whilst Subworkflow can be a powerful utility for AI projects, it's still a specific tool made for a specific job.
-
✅ You need to handle very large document workloads - the bigger the better!
Projects with large or medium document workloads will get the most value out of Subworkflow AI.- Large document are defined as having either 1000+ pages or 300mb+ filesize.
- Medium documents are defined as having either 300 - 1000 pages or 100mb - 300mb filesize.
- Small documents are defined as having either up to 100 - 300 pages or 50mb - 100mb filesize.
If your documents are any smaller, Subworkflow might be overkill for what you want to do. In this instance, we find it easier to just upload the whole document to the LLM and rely on the context window to handle your RAG agent.
-
✅ You're only interested in the Retrieval functionality for your AI/RAG Project
Subworkflow.ai assumes customers are able to build on top of our technology. We provide the APIs and pipeline but you are required to provide your own frontend and LLM - whether this is a chat interface, workflow or agent. If you require a frontend or integration, please reach out and we'll connect you to our AI experts network. -
✅ You need to Speed Up Project Development and reduce Infrastructure Costs
The Subworkflow service is an architecture of servers, queues, workers, storage, databases and APIs which though simple to understand, requires substantial time and effort to put together. We realised this when building our own RAG projects and so now offer the service as a way for teams to offset this cost for their AI initiatives.
During our beta phase, We'd love to chat about your use case and get feedback on our offering. Visit our contact page to get in touch.
Sign up and Register your Organisation
You'll need to sign up an account with your organisation (if not invited to one) to gain access to Subworkflow. An organisation respresents the company you work for and is where you'll manage your workspaces, datasets and team as well as access controls and billing.
Get Started for FreeSubscribe to 14 Days Free Trial
Our 14 days free trial is the best way to evaluate the Subworkflow AI service. Sign up to the Standard Plan to start working with Medium to Large Documents or the Starter Plan if your primary challenge is frequency and not size. Unfortunately, we do not currently have a free tier but please do reach out to us at support@subworkflow.ai if you wish to discuss using Subworkflow in Education.
Note: We use Stripe™ to handle our subscriptions. We require a card to sign up to the trial, this is to prevent abuse and help focus our support priorities.
Generate a Workspace API Key
Workspaces help organise and scope access to documents (or rather Datasets) for your team, clients or projects. API Keys are also workspace-scoped meaning they are only valid for the workspace they are generated for. In the Workspace > Settings > Keys page, click on the "Create a new API key" button to create your key. Once the new API key is created, keep it secret and copy only when you're ready to use!
- All files uploaded goes to the workspace the key belongs to.
- All queries and searches will be scoped to the same workspace the key belongs to.
- Keep a note of which workspace a key belongs to!
You're now ready to upload your first Document!
The intended way to use Subworkflow is through the REST API and how you'll start uploading your documents. We recommend using our SDKs if you can as they help to simplify Subworkflow usage in your application.
- Curl
- JS/TS
curl -X POST https://api.subworkfow.ai/v1/extract
--header 'x-api-key: <YOUR-API-KEY>'
--form "file=@/path/to/file"
If you plan to use the /search functionality, use the following instead:
curl -X POST https://api.subworkfow.ai/v1/vectorize
--header 'x-api-key: <YOUR-API-KEY>'
--form "file=@/path/to/file"
import { Subworkflow } from "@subworkflow/sdk";
const subworkflow = new Subworkflow({ apiKey: "<MY-API-KEY>" });
const fileInput = fs.readFileSync("annual_report.pdf");
const dataset = await subworkflow.extract(fileInput, { filename: "annual_report.pdf" });
If you plan to use the /search functionality, use the following instead:
const dataset = await subworkflow.vectorize(fileInput, { filename: "annual_report.pdf" });
Retrieve Your Dataset and Its Items
Once a document is uploaded to Subworkflow, it becomes a Dataset and its pages are refered to as DatasetItems. We'll be using this terminology a lot through the documentation.
Retrieving the Dataset (Document)
Requesting the dataset can provide you a link to a pdf-version of the original document and tell you how many pages were contained within (itemCount). Typically, you'll fetch the dataset record only for the metadata needed to query over its Dataset Items.
- Curl
- JS/TS
curl https://api.subworkfow.ai/v1/datasets/<datasetId>
--header 'x-api-key: <YOUR-API-KEY>'
{
"sucess": true,
"total": 1,
"data": {
"id": "ds_VV08ECeQBQgDoVn6",
"workspaceId": "wks_Gg9Bzi7sx8fbCfWI",
"type": "doc",
"itemCount": 1,
"fileName": "file_AIpNsoTx4OkRNY3H",
"fileExt": "pdf",
"mimeType": "application/pdf",
"fileSize": 136056,
"createdAt": 1761910646651,
"updatedAt": 1761910646651,
"expiresAt": 1761910646651,
"share": {
"url": "https://api.subworkflow.ai/v1/share/dsx_DdTXOgxPh0PLSPhb?token=VkVBNh",
"token": "VkVBNh",
"expiresAt": 1761910891643
}
}
}
import { Subworkflow } from "@subworkflow/sdk";
const subworkflow = new Subworkflow({ apiKey: "<MY-API-KEY>" });
const dataset = await subworkflow.datasets.get('<datasetId>');
{
"id": "ds_VV08ECeQBQgDoVn6",
"workspaceId": "wks_Gg9Bzi7sx8fbCfWI",
"type": "doc",
"itemCount": 1,
"fileName": "file_AIpNsoTx4OkRNY3H",
"fileExt": "pdf",
"mimeType": "application/pdf",
"fileSize": 136056,
"createdAt": 1761910646651,
"updatedAt": 1761910646651,
"expiresAt": 1761910646651,
"share": {
"url": "https://api.subworkflow.ai/v1/share/dsx_DdTXOgxPh0PLSPhb?token=VkVBNh",
"token": "VkVBNh",
"expiresAt": 1761910891643
}
}
Retrieving the Dataset Items (Pages)
In this example, we retrieve the equivalent to the 1st, 3rd and 5th pages from our document in jpeg format. Notice that this is particular powerful when handling large documents (1000+ pages) - you don't necessarily need to receive full dataset as you do with other services, pick out only the pages you want! Other cols patterns include the range modifier where cols=50:100 will return pages from 50 to 100. For full details on querying options, please refer to the API reference documentation.
- Curl
- JS/TS
curl https://api.subworkfow.ai/v1/datasets/<datasetId>/items?row=jpg&cols=1,3,5
--header 'x-api-key: <YOUR-API-KEY>'
{
"sort": ["-createdAt"],
"offset": 0,
"limit": 10,
"total": 3,
"data": [
{
"id": "dsx_B5bsOBDzsXsqfmLo",
"col": 1,
"row": "jpg",
"createdAt": 1761910649511,
"share": {
"url": "https://api.subworkflow.ai/v1/share/dsx_B5bsOBDzsXsqfmLo?token=kCThsH",
"token": "kCThsH",
"expiresAt": 1761911418809
}
},
{
"id": "dsx_1muCWQXZ58r5PsjC",
"col": 3,
"row": "jpg",
"createdAt": 1761910649511,
"share": {
"url": "https://api.subworkflow.ai/v1/share/dsx_1muCWQXZ58r5PsjC?token=Qqkk7U",
"token": "Qqkk7U",
"expiresAt": 1761911418809
}
},
{
"id": "dsx_0yIaKxZjiZIXc1G3",
"col": 5,
"row": "jpg",
"createdAt": 1761910649511,
"share": {
"url": "https://api.subworkflow.ai/v1/share/dsx_0yIaKxZjiZIXc1G3?token=7GQKco",
"token": "7GQKco",
"expiresAt": 1761911418809
}
}
]
}
import { Subworkflow } from "@subworkflow/sdk";
const subworkflow = new Subworkflow({ apiKey: "<MY-API-KEY>" });
const datasetItems = await subworkflow.datasets.getItems('<dataset>', {
row: 'pdf',
cols: [1,3,5],
});
// console.log(datasetItems)
[
{
"id": "dsx_B5bsOBDzsXsqfmLo",
"col": 1,
"row": "jpg",
"createdAt": 1761910649511,
"share": {
"url": "https://api.subworkflow.ai/v1/share/dsx_B5bsOBDzsXsqfmLo?token=kCThsH",
"token": "kCThsH",
"expiresAt": 1761911418809
}
},
{
"id": "dsx_1muCWQXZ58r5PsjC",
"col": 3,
"row": "jpg",
"createdAt": 1761910649511,
"share": {
"url": "https://api.subworkflow.ai/v1/share/dsx_1muCWQXZ58r5PsjC?token=Qqkk7U",
"token": "Qqkk7U",
"expiresAt": 1761911418809
}
},
{
"id": "dsx_0yIaKxZjiZIXc1G3",
"col": 5,
"row": "jpg",
"createdAt": 1761910649511,
"share": {
"url": "https://api.subworkflow.ai/v1/share/dsx_0yIaKxZjiZIXc1G3?token=7GQKco",
"token": "7GQKco",
"expiresAt": 1761911418809
}
}
]
Use Retrieved Files as Context for LLMs
With each retrieved Dataset Item, Subworkflow will give you a "share url". This is a url to a page of your document that is protected by a expiring token. We find this a more secure approach as even if the URL is cached/leaked, the file will be inaccessible after the expiry.
"share": {
"url": "https://api.subworkflow.ai/v1/share/dsx_B5bsOBDzsXsqfmLo?token=kCThsH",
"token": "kCThsH",
"expiresAt": 1761911418809
}
Pass these urls into your LLM along with your prompt to generate answers or for extracting structured output.
- Curl
- JS/TS
curl https://api.openai.com/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-4.1",
"input": [
{
"role": "user",
"content": [
{"type": "input_text", "text": "what is in this image?"},
{
"type": "input_image",
"image_url": "https://api.subworkflow.ai/v1/share/dsx_B5bsOBDzsXsqfmLo?token=kCThsH"
}
]
}
]
}'
import OpenAI from "openai";
const openai = new OpenAI();
const response = await openai.responses.create({
model: "gpt-4.1",
input: [
{
role: "user",
content: [
{ type: "input_text", text: "what is in this image?" },
{
type: "input_image",
image_url: "https://api.subworkflow.ai/v1/share/dsx_B5bsOBDzsXsqfmLo?token=kCThsH",
}
]
}
]
});
console.log(response);
Next steps
Congratulations! 🎉 If you've managed to make it this far, you should have some idea of how Subworkflow can help you build powerful and more durable RAG and/or Structured Output applications.
Head on over to the next section on ways to use the Subworkflow APIs in your application and workflows.