We invite ML practitioners to share their experiences with the Evidently Community during the Ask-Me-Anything series.
Our latest guest is Fabiana Clemente, Co-founder and Chief Product & Data Officer of YData, a platform that improves the quality of training datasets.
We chatted about synthetic data, its quality, beginner mistakes in data generation, the data-centric approach, and how well companies are doing in getting there.
Sounds interesting? Read on for the recap of the AMA with Fabiana.
Synthetic data adoption
Is the synthetic data process (for example, for creating or augmenting training data) beneficial for small companies and individuals, or is it something only big companies can use now?
Great question! I would say it doesn't depend so much on the size of the company but rather on the use case and data available. I think that is what really defines whether synthetic data is the right investment.
For instance, you can be a small company that wants to work with healthcare. Synthetic data is probably the right vehicle to ease access.
Do you see a trend that some industries are faster in adopting synthetic data than others (e.g., healthcare or finance, as they deal with sensitive data)?
I do see indeed. Healthcare is one of the areas where we see more active research within the synthetic data space: from data sharing to be an auxiliary tool for Federated Learning. But when it comes to day-to-day usage, I think Financial services and Insurance are the industries more actively looking into adopting synthetic data.
Data-centric approach to ML
"Garbage in, garbage out" seems common knowledge in DS and ML. But what does it mean to be data-centric indeed?
In my perspective, Data-Centric AI recovers a hold concept and key paradigm of data science development — that data is core, and the better your data, the better your outcome.
But this new paradigm brought something more holistic, saying that the model is the artifact to be fixed. Instead, the data is the one that should be iterated, improved, and monitored throughout the lifecycle of AI.
What is the role of synthetic data in Data-Centric AI?
Synthetic data is definitely an important tool to have within your data science toolkit. Even Gartner is stating so.
In my perspective, it is a very versatile tool: from easing the data access to balancing datasets, helping with bias mitigation, and conditional generation of records, you name it. In a nutshell, it can mitigate many problems and challenges you can find when preparing a dataset for a model, making it core to the Data-Centric paradigm.
Some business stakeholders in the enterprise expect "algorithms" to figure things out automatically, even from "bad" data. They are not open to investing in better quality data, labeling, data quality control, etc. Now that Data-Centric AI is a thing, do you see it changing in the enterprise? Or do we still have to educate the business?
I think there's much education to be done at the business level. Many processes need to be changed while working with data, and some start at the level of the company culture in matters related to data. Nevertheless, I think Data-Centric AI is bringing a much-needed new perspective on the subject and creating momentum at the business level, which I believe will contribute to making the processes easier.
Many organizations are now looking into data quality as a cost reduction opportunity instead of a "nice-to-have" :)
Tools to generate synthetic data
What kinds of tools do you see in the open-source space for synthetic data creation? Do we have good tools available, or is there an imbalance where most of the best-in-class tools appear to be closed-source?
There are some interesting and worthwhile checking. I'll highlight two, but there are a few others:
- ydata-synthetic (focused on the use of DeepLearning models for synthesis)
- SDV (which encompasses many techniques, from deep learning to Bayes networks)
In the case of imbalanced classes, from what I've seen within the open-source space, no package holds the best in class.
Depending on the type of data we are talking about (images, text, or structured), the techniques are widely different, and even in the realm of structure data, it depends on the use case and data structure.
For some use cases, SMOTE and ADASYN (from sklearn) are enough to deal with the imbalanced classes problem. Still, GANs are a better option for others where you've bigger dimensionality and complex relations. GANs can be challenging, though, due to a lot of parameters to tune.
Is there some kind of spectrum of difficulty for synthetic data types? I.e., structured data is easier than unstructured data? I ask since it seems that tool-builders often focus on structured data first, and it feels like the unstructured types get a little neglected, at least early on.
About synthetic data, I would say it is the opposite. We see the space of unstructured data is very well developed (from videos to images), with a strong community both in open-source and platforms.
For structured data, we see a bigger investment happening since 2017 but mainly focused on privacy (research). However, synthetic data can be leveraged for use cases such as augmentation and balancing datasets or bias mitigation.
Are there baseline models for synthetic data? Is it pretty much VAE and GANs? Has something emerged as the go-to method?
VAEs and GANs are a way to do it. But there are other methods that are not Deep Learning based, such as Markov chains or Bayesian Networks, for instance. In a nutshell, any generative model can be used as a tool for data synthesis.
We now have methods such as GPT-3 and Dall-E as industrial solutions to generate text and images. What would you say are the following things we can expect to see in the field for "synthetic data as a service"?
This is a tough one! To be honest, I think we will still see more advances within the realm of video and images. Structured data is trickier to generalize and deliver as an automated service with no business background embedded.
I was wondering if there are approaches for synthetic data for time series. I have tried variational autoencoders and other generative models. However, the results looked "too smooth," and the time series were "spikey." It didn't seem that those approaches could be usable in practice. Are there libraries or approaches that work well for sequence data?
Indeed that's a normal pattern/behavior you will find while generating synthetic data for time series. Solutions based on RNN and CNN tend to have limitations and lose some long-term contexts while generating time series, resulting in the behaviors you've described.
From an open-source perspective, I haven't seen anything that works smoothly, but the best so far is DoppelGANger, which is GAN based. Btw we are working on making it available using TF2 at the ydata-synthetic package.
Challenges of synthetic data
Synthetic data seems to be an answer to improving the quality of training datasets. But how do you ensure the quality of synthetic data? How do you know it's good enough to use? Does the approach to monitoring differ somehow for synthetic data?
Talking particularly about structured data (my domain of expertise), we cataloged the quality of the synthetic data into three main pillars: fidelity, utility, and privacy.
Fidelity is the statistical similarity and coherence between the real and synthetic data.
Utility measures whether the generated data could be used for a downstream application such as queries or ML models.
Lastly, privacy measures whether your synthetic data is just a copy or memorized original data.
Each of these pillars comprises specific metrics that help us understand how the synthetic data behaves. The quality, overall, will depend on the use case. Because in the end, there's always a trade-off between these three.
What are some misconceptions about synthetic data?
That synthetic data can be fully private and, at the same time, keep the same utility and fidelity as the original dataset.
Another one is that synthetic data is not use case dependent or shouldn't be optimized for a particular downstream application.
How do people go wrong when generating synthetic data? What are the rookie mistakes that a beginner should try to avoid?
Interesting question! I think one is to expect the exact same distribution (including the outliers), which needs to be properly dealt with depending on whether we are using the synthetic data for purposes of augmentation or privacy.
Another is the one-fits-all synthetic data generated dataset: this is probably one of the most challenging bits of synthetic data. It is versatile but is still highly influenced in terms of quality, by the knowledge and inputs of the expert/Data Scientist.
We constantly see the "threat of deep fakes" hyped in the press. However, the output of generative models still looks very "synthetic." How far do you think we are from generating data humans cannot distinguish? Is it still happening this decade?
The investment in synthetic data generation is not at its peak yet, so I think we have a good chance to achieve realistic results within the current decade.
We have already seen some promising results from some models, so I do believe this is possible to happen soon!
In your opinion, what is the role of communities in developing modern ML tools? What was your core motivation when creating Data-Centric AI community?
Definitely core! In my opinion, communities are where interesting discussions happen, and users are comfortable sharing their ideas. It creates a safe place to learn and a source of knowledge for practitioners.
It was the right vehicle for us to create meaningful discussions around the topic, support others to learn more, access the rights to open-source projects, etc. With a stronger community around Data-Centric AI, I think supporting organizations adopting the paradigm will be easier.
* The discussion was lightly edited for better readability.
[fs-toc-omit] Want to join the next AMA session?
Join our Discord community! Connect with maintainers, ask questions, and join AMAs with ML experts.
Join community ⟶