Types of data:
- Synthetic data
Synthetic data is information that's artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models. Data generated by a computer simulation can be seen as synthetic data.
While data augmentation can be a powerful tool for improving the performance and robustness of NLP models, there are also some potential downsides to using synthetic data, including:
- Quality concerns: The quality of the synthetic data generated through data augmentation can be variable, depending on the specific techniques used and the characteristics of the original data set. Poor quality synthetic data can introduce noise and inaccuracies into NLP models.
- Overfitting to synthetic data: If the synthetic data generated through data augmentation is too similar to the original data, NLP models can overfit to the synthetic data and perform poorly on new, unseen data.
- Lack of diversity: Some data augmentation techniques may generate synthetic data that is not diverse enough, resulting in models that are biased or that perform poorly on certain types of data.
- Increased computational requirements: Generating synthetic data can be computationally intensive, particularly when dealing with large data sets, which can increase the time and resources required to train NLP models.
- Ethical concerns: There may be ethical concerns associated with generating synthetic data, particularly if the data is used to train models for applications such as surveillance, predictive policing, or other potentially sensitive areas.
- Reduced interpretability: Synthetic data may be more difficult to interpret than real-world data, which can make it harder to understand how NLP models are making predictions.
- Domain Dependence: Data augmentation techniques may not generalize well across different domains or applications, requiring additional domain-specific augmentation techniques.
- Lack of real-world accuracy: Synthetic data is not always an accurate representation of the real-world data, which can lead to overfitting and poor performance of machine learning models when deployed in the real world.
- Bias: Synthetic data may inherit biases from the data used to generate it, which can lead to biased machine learning models.
- Limited application: Synthetic data is not suitable for all applications, especially in cases where the data being modeled has complex and unpredictable real-world variations.
- Validation: The validation of synthetic data can be challenging because there is no ground truth to compare it to. In other words, it may be difficult to assess the quality of synthetic data, as there is no real-world data to compare it to. This can lead to uncertainty about the effectiveness of the synthetic data for training machine learning models. Furthermore, synthetic data may not capture the complexity of real-world data, leading to over-optimistic results when tested on synthetic data, but poor performance when deployed in the real world. Therefore, it is important to validate synthetic data carefully, using techniques such as cross-validation or testing on real-world data when available, to ensure that the synthetic data accurately represents the real-world data and can be used effectively for training machine learning models.
| Title | Description, Information |
|---|---|
| Benchmarking Unsupervised Outlier Detection with Realistic Synthetic Data | Benchmarking unsupervised outlier detection is difficult. Outliers are rare, and existing benchmark data contains outliers with various and unknown characteristics. Fully synthetic data usually consists of outliers and regular instance with clear characteristics and thus allows for a more meaningful evaluation of detection methods in principle. Nonetheless, there have only been few attempts to include synthetic data in benchmarks for outlier detection. This might be due to the imprecise notion of outliers or to the difficulty to arrive at a good coverage of different domains with synthetic data. In this work we propose a generic process for the generation of data sets for such benchmarking. The core idea is to reconstruct regular instances from existing real-world benchmark data while generating outliers so that they exhibit insightful characteristics. This allows both for a good coverage of domains and for helpful interpretations of results. We also describe three instantiations of the generic process that generate outliers with specific characteristics, like local outliers. A benchmark with state-of-the-art detection methods confirms that our generic process is indeed practical. |
| Generative Adversarial Networks for Realistic Synthesis of Hyperspectral Samples | This work addresses the scarcity of annotated hyperspectral data required to train deep neural networks. Especially, we investigate generative adversarial networks and their application to the synthesis of consistent labeled spectra. By training such networks on public datasets, we show that these models are not only able to capture the underlying distribution, but also to generate genuine-looking and physically plausible spectra. Moreover, we experimentally validate that the synthetic samples can be used as an effective data augmentation strategy. We validate our approach on several public hyper-spectral datasets using a variety of deep classifiers. |
- Synthetic data on Wikipedia
- What Is Synthetic Data?, Nvidia blog
- Evaluating Synthetic Data using Machine Learning
- The use of Synthetic Data in Financial Services
- What to do with small set of labeled data and large set of unlabeled data?
- Pseudo-Labeling to deal with small datasets — What, Why & How?
- A Guide to Learning with Limited Labeled Data
- How do I select a subset of a DataFrame?, Pandas documentation
- Indexing, Slicing and Subsetting DataFrames in Python
- 23 Efficient Ways of Subsetting a Pandas DataFrame
| Title | Description, Information |
|---|---|
| static-frame | Immutable and grow-only Pandas-like DataFrames with a more explicit and consistent interface. |
| Snorkel | A system for quickly generating training data with weak supervision |
