Technical features
Our data generation pipeline is of an agentic nature, so a chain of data processing with AI is in between.
We take two inputs from a user - data structure and data content. On MVP launch, for the first week, these will be templated, we will allow access to tune these parameters shortly after. This is done both for budgeting and testing. Once a user provides these two parameters, we can take an example from our lite product - generation of data about legal incidents. So the data structure looks as follows:
{
"scenario": {
"incident": "String",
"involvedParties": "String",
"evidence": "String",
"previousSimilarCases": "String"
},
"parameters": {
"jurisdiction": "String",
"relevantLaws": "Array of Strings",
"legalQuestions": "Array of Strings"
},
"enforceableUSLaw": {
"applicableLaw": "String",
"rationale": "String"
}
}
and the content is vaguely described for this example legal incident.
Our protocol takes these inputs, processes both, from the first one it, obviously, takes the structure of the data. While providing in JSON here, it can take in any data format, then processes it into our own internal data structure. The second input is passed directly into the first LLM in our chain, for this we use either GPT4 turbo which is fine tuned (also a reason why templated generation for the first week), Cloud 3 for difficult reasoning and gpt 3.5 turbo for small tasks. By combining the preprocessed data structure and the input of data content, we start a chain of events which generates rows of unique data. Each generated row is checked against previous ones and starting conditions to ensure uniqueness, it is also checked against our internal metrics to ensure quality of data. We do this at a scale which is determined by the end user (scale is capped for MVP for budgeting reasons) and eventually we output a fully processed dataset that is fitted for our users needs.
Last updated