structured data to make ML [Machine Learning]
more precise and accurate in eCommerce.

The Problem

  • Preamble
    "You train a classification model with 10,000 images of shoes. Everything works wonderfully until Sunday: Somebody releases a new pair of sneakers. Your model has never seen them, but now it needs to classify them somehow."
    Santiago Valdarrama - link
  • The Analysis 
    Building ML (machine learning) models is time-consuming, complex, and expensive. The core of the ML models is the client's data. Every client has a unique dataset that is composed by evolving data, data anomalies and unique classification rules.
  • The Barrier
    The uniqueness of the data makes it very difficult to reuse ML models. Every change to the client's dataset can make the ML models obsolete.
    The impossibility of scaling models + the short life of models generate prohibitive costs that scare away companies.
  • The Problem
    Considering that data is responsible for the barriers of ML models. Considering that data issues are extremely frequent in Ecommerce (causing short lived models and low quality models).
    How can we build better models, increase the life cycle, and take advantage of scalability?

The Problem - eCommerce Demonstration

We analyzed the top 30 Fashion websites looking for data issues on the most common product we could find: T-shirts

100% of the sites were missing one or more
references to the basic t-shirts attributes.

80,7% of the sites had wrong items on the search results (cataloging errors).

76,9% of the sites had products that didn’t appear when we applied the search filters

The Problem - Demonstration


As we can see, the eCommerce data issues are common even on the top 30 websites of the eCommerce industry.

The data uniqueness forces the development of unique ML models for every single site. The data uniqueness forces the companies to have custom built ML models that will have smaller a life cycles and that can't be scaled.

Our Solution


Our Solution

Shifting the focus from the company's data to the product's data will remove most of the data singularity from ML models, cutting costs and allowing scalability.
A T-shirt will always be a T-shirt no matter who sells it. A T-shirt will always have the same physical properties independent of the sellers unique data.
For us the focus needs to be firstly on the product and then on the company's data.

Our Solution

We want to build product blueprints with all the product’s options, properties, and variables.

This products structure data will be a knowledge base / starting point for the data scientists.

Products structure data will remove a huge part of data analysis process, reduce analysis errors and largely reduce costs.

Our Solution Vectors

  • Less Complexity
    By having a map (structured data schema) of all the properties and options of a product, we can "janitize" and organize all data obtaining a cleaner, more reliable, and faster way to process data.
  • Better quality
    With the blueprint in hand, we can easily discover and correct missing data and data inconsistencies.
    This will allow faster implementation of AI-based solutions and, by example, better search results
  • Scalability
    By using the structured data schemas the eCommerce data will become more tangible and will allow the use of cross websites solutions. This will allow the scalability of AI solutions.
  • Lower Costs
    With structured data (less complex and with higher quality) the AI solutions providers can take advantage of the scalability of their processes and lower significantly their prices.

Our Solution - Benefits

Example of benefits in the voice sector
digital assistants, chatbots, IVR’s
  • Automatic Dialog Generation

    By having access to a structured data schema, it is simple to generate product descriptions and answers to questions

  • Better search experience

    In the voice space, the users tend to ignore the limitations of the search filters. By having a structured data schema, is simpler to understand the user request, allowing more correct answers than the typical “I don’t know that one” answer

  • Faster Development and Maintenance

    By using the blueprints it is faster and simpler to construct/reuse an agent.  The maintenance work will be lower because there is no need for custom development.

Our Solution - How we do it

Selecting and Curating

we select and curate the best source data to build the structured data

AI Process

we use proprietary NLP models to process the
curated information


we use a mix GTPs, vectorization, and clustering to generate structured data schemas

About is focused on building structured data and knowledge bases for the development of better ML models.

Our work can be used widely in the eCommerce ecosystem. Structured data schemas will allow from better product recommendations up to the development of self-learning chatbots or digital assistants.