visualization | Information Service Blog

18-prep-4-violins[pce19.12]

Data mining / exploration and discovery is a mesmerizing part of data analysis. The tedious grunt work of preparing the data for analysis, is not such an exciting topic. Given the amount of resources organizations dedicate to the task of data preparation which are estimated to be from half to three quarters of all data processing expenditures, it seems only reasonable to devote an appropriate amount of energy to continued written clarifications on the subject.

It makes sense that we frame the discourse of this paper through a definition for the topic itself; all too often, the term “preparation” is conflated with other terms that ultimately increase confusion on the subject. In reality, definitions evolve over time, and what was considered to be part of [x] in earlier years is often described as something different in the next generation.

Current definitions for data prep vary, depending on whom you ask; but the fact that the activity is of upmost importance is not in question: Forrester author Michele Goetz sums it up nicely, declaring “New big data environments, faster data integration, and analytic appliances aren’t the answer [to today’s data analysis challenges]. Your analysts need better tools to speed up data preparation efforts…” [Vendor Landscape: Data Prep Tools, 2016]. One need only look at the volume of venture capital afforded to this space, to appreciate the market value. Of the notable private firms with under 10,000 employees operating in the space, a whopping $100 million in VC funding has been allocated [on average] to each vendor since 2016.

While searching for a definition, this paper will skim important concepts which are closely related to preparation, helping to refine the notions of what prep is, by also looking at what it is not. Second, we will analyze some weaknesses in commercial reports on preparation software; and in the last chapter, manipulate some published data on commercial prep software, and create summary visualizations of the analysis.

Table of Contents

Chapter 1	Definitions	Kind of boring
Chapter 2	Describing analysis of commercial analysts’ reports, via text	Painfully boring
Chapter 3	Visualizing analysis and synthetic unions of analysts’ reports	Fascinating

Chapter One, Defining Preparation

So what exactly is data prep? The following examination uses the most credible of published analyst reports and vendor white papers, [which are not all that different] as secondary resources for answering this question. [Ultimately, it seems to boil down to Ovum’s statement on page 5].

In Bloor’s Self-Service Data Preparation Spotlight Paper, author Phillip Howard provides domain specific jargon that he believes are key components of the larger data prep concept [Howard, 2016]:

Blending: this is combining datasets.

Munging: this is transformation.

Shaping: this is data preparation.

Wrangling: this is the combination of prep and transformation.

Dave Wells lays out a list of prep functions in Eckerson Group’s Data Prep Buyer’s Guide. Looking at a list of functions as a whole can give us a description of prep, which might serve as equivalent to a definition: Visual exploration is the critical first step. Exploration includes a summary of the salient characteristics of a dataset. Pattern discovery is part of visual exploration. Patterns can be basic statistical distribution of values; or more complex, clusters of attributes based on similarity. Patterns are a man’s best friend, but that is a concept for another book. Note that visual exploration is aka discovery.

Transformation: for Wells, this function is composed of improvement, enrichment, and formatting. Wells defines improvement as standardizing field values and conforming data to common formats. Enrichment is defined as derivation and appending. Examples are “calculating age based on birthdate” and extending a street address to include a longitudinal geocode. Formatting is defined as sorting records, sequencing fields, masking, and finalizing records for output.

Cleansing: this function improves data quality. Defective data can be removed or replaced “with derived, default, or most-probable values.

Blending: involving the combination of data from multiple sources.

Lineage tracking: related to governance, and confidence and trust.

Metadata and cataloging. Catalogs collect metadata, and that metadata aids in many aspects of operations including lineage tracking, governance, and search.

Data modeling: capability of this function extends only to inferring whether a data model is canonical or semantic.

Integration and interoperability. There is a problem of having to load data from access tools to discovery tools to prep tools to analysis tools and finally to reporting tools in order to complete processing [Wells, 2017]. The problem is essentially cost.

Rounding out the list, are: sharing [and reuse]; user defined functions; and scalability and reliability.Most people are apparently fans of the use of visuals to simplify descriptions of a subject. In trying to picture concepts of things like data preparation however, diagrams and visualizations almost always fall short. In the case of data preparation, the steps of a workflow do not always progress sequentially. This can be a problem for simplification [and thus definition], as Jane may describe the steps as 1, 2, 3; but John may say the steps are 2,4, 3,4, 4,5.

In Data Prep is not an Afterthought, Gartner author Lakshmi Randall rightly diagrams data preparation as an iterative process, with a workflow that repeatedly loops back to its own earlier ‘steps.’

prep-steps-pce19.2

Randall goes on to say that exploration of the data and identification of the transformations needed are parts of data prep. In the ensuing bullet points in her report Randall points to profiling, cataloging, statistic computations, metadata, and transformation as key prep techniques [Randall, 2014]. Taking a different view of workflows, Kirk Borne describes prep as step two in data science [Inside Analysis, 2014], where transformations are external to data preparation, placing his view in disagreement on how closely the activity of transformation fits with prep.

Perhaps it would be good to frame the concept inclusively, in a set theory kind of way, as opposed to exclusively. The following Venn-style diagram, created to summarize a data integration text document, presents transformation [in green] as a segment of both data quality functions [encapsulated within the yellow border] and core data integration functions [encapsulated within the blue border]. Note however, that in this diagram transformation has a 2-point relationship w blue, whilst only a 1-point relationship w yellow. We could also say data prep would in fact be very similar to the range of yellow components in the diagram below.

colorwheel-to-scale-pce19.3

Whether x comes before y in a workflow, or whether they should be viewed as separate pieces in a diagram, it might be conceptually more productive to think of these entities simply as process objectives. One objective is collection of the data. Another objective is understanding the condition of the data, which often involves building a catalog for user reference. Another objective is to improve the data, which can involve cleansing, restructuring, normalizing, and integrating data [aka blending]. [Many professionals consider data modeling as categorically part of this objective as well, although there can be significant lag times between the preparation and the modeling].

A fourth objective is the stewardship and governance of the data and workflow. This objective has a close relationship with access control rules and security policies. Another objective is to maintain the knowledge about the data and the analysis of particular workflows, so that other users can reuse the knowledge instead of having to figure out for themselves what someone else already knows about the data or has already done in terms of analysis [this is aka ‘sharing’ of fully developed queries, etc]. The objective of ‘sharing’ takes us back to the Catalog and its twin sister, the metadata repository.

Catalogs and metadata repositories are hellish expensive to build and maintain unless the functions are automated. The value from catalogs and metadata come back in the forms of improved retrieval of assets; improved proof of adherence to governance policies; and improved identification of data quality issues.

Chapter 2, Analyzing Commercial Reports

As noted in previous articles, the charts in MQs and Waves are over-generalizations of product capabilities. MQs are less informative than some other reports for product specifics; in comparison to Waves for example which which provide quantitative scoring and taxonomic hierarchies as part of their reports. In any case we want to extract additional value from what is in these and other documents. Starting with 2016, we will work through 2017, and 2018.

2016

A rather detailed analysis of Gartner’s 2016 Market Guide for Self-Service Data Prep has already been completed here. That [4DI] analysis looks at over 20 vendors along eight different capability dimensions; but in reality, the Gartner Guide is a very BI centric report. In that the BI product field is much larger than the data prep field, in terms of vendors, the Gartner Guide is arguably including too many vendors that are not actually focused on data prep workflows.

One of the takeaways from the 4DI analysis of the 2016 guide is that the capability dimensions of transformation; tool integration; use of ML; and metadata and cataloging were most salient. [Diagram of said dimensions in Appendix A].

For the 2016 section of this chapter, we can map another resource to that prior work: Bloor’s 2016 SS Prep and Cataloging report; however, due to various forces, the truly valuable results of analysis on Bloor’s report deviate a bit too much from the main goal of this paper, so that content has been relegated to Appendix B.

2017

Bloor’s 2016 report and Forrester’s 2017 Prep Wave report were released 13 months apart. Some of the text on Datawatch [DWCH] is literally identical, which makes it obvious how much these reports rely on information fed to them by the vendor. Forrester covers only seven products in the 2017 Prep Wave [Little, 2017], all of which they describe as stand-alone[s], not part of a BI tool. The pecking order on the blue Wave diagram is clear: my ranking would be weighted heavily on the “Current Offering” axis [which = product capabilities], resulting in: Paxata, Trifacta, Unifi, Alteryx, SAS, Datawatch, and Oracle, in that order.

The capability dimensions in Forrester’s score breakdown table are: discover and blend, standardize and enrich, transform, deliver, and share. These dimension terms are not used again in any clarifying way in the 2017 Wave document, interested readers have to go back 13 months and locate Vendor Landscape: Data Prep Tools authored by Goetz to get a somewhat corresponding picture of the underlying taxonomy. Goetz’s Landscape doc wholly fails to clarify how the Transform dimension relates to the other dimensions; but one good point it does raise is how tool selection must depend on use case.

With the exception of Unifi, scores for Deliver and Sharing look the same. Note that Oracle is excluded from the score-breakdown table because they did not participate. The table below is a summary of the text content of the 2017 Prep Wave.

How Forrester ranked them	Strength	Weakness
Trifacta	ML	Search and collaboration
Paxata	ML, semantic analysis, bidirectional integration w BI tools.	Search and connector optimization
Alteryx	Workflow interface	Search, ML
Datawatch	Macros	ML
SAS Data Loader for Hadoop	Connectors, macros	Search, ML, and collaborate features
Unifi	ML, NLP Search
Oracle Big Prep Cloud	ML driven recommender

2018

Ovum

Ovum provides a report in 2018 that is helpful for refining descriptions; and potentially also for triangulating evaluation accuracy of other reports. In Selecting a SS Data Prep Solution [Bartley, 2018], Ovum declares: ““Data prep” is no longer a clean, discrete market; it is overlapping, messy, and deeply intertwined with the information governance market.”

Ovum assesses 2018 prep products in seven technology dimensions:

Integration and exploration [connecting to sources, and initial exploration and profiling of data].

Manipulation [core prep: cleansing, blending, transformation, enrichment, and modeling].

UX and UI

Data output and analytics [export, and native analysis].

Collaboration and ML [ML powered functionality for user collaboration].

Administration [architecture, deployment, and processing].

Data governance [metadata and catalog management, and security of data]. Bartley notes early on that a key product capability that must be available in order for the product to be included in the report, is cataloging and or metadata repository.

Bartley points out the technology capabilities for Governance, and collaboration and ML typically set the top products apart; in contrast to product scores in administration, integration and evaluation, and manipulation dimensions, all closely grouped, suggesting “relative market maturity” in these areas.

Current trends in the overall market of most interest IMHO are – “Differentiators include machine learning guided functionality, connectivity to analytics tools, and governance features.” The report also says “SS data prep, traditionally served as a feeder to the SS analytics ecosystem… [is] increasingly feeding prepped data into ML models.” The current fascination with ML absolutely dictates that we facilitate clarity on what exactly ML does for us in these data prep tools.

Bartley lists the following: ML can power detection of workflow actions; detection of outliers in data; detection and ingest of files including files classified as sensitive; automatic deduplication; and recommendations for joining data and sources. These can all be rolled up into ‘automated suggestion / guidance functionalities.’

Wells lists the following: in cleaning: Machine learning and pattern mining are helpful [functions] to determine most probable values” based on related or surrounding data; and ML can detect deficiencies and recommend actions to improve quality. ML can recommend blending techniques.

ML can actually “increase the degree to which data prep becomes a collaborative effort [because] as each analyst” does their thing, their actions and results become part of a shared KB. This doesn’t sound like ML per se, but if we look at it as part of a system which can make inferences about data and make suggestions to users, then perhaps.

I think Howard better generalizes ML: prep processing activities are food for ML engines that can make future prep activities easier, perhaps by predicting how you will want to process new data from a new source, potentially even ranking a list of options.

Forrester

Forrester’s 2018 Data Preparation Solutions Wave [Little, 2018] also scores products on seven dimensions, which understandably are not the same dimensions as Ovum’s. However, Forrester’s 2018 dimensions are completely different than they were in the 2017 Data Preparation Tools Wave written by the same author, so we are not able to compare dimensions over consecutive years or across time. We are able to at least cluster products more closely with like-products for apple to apple comparisons at two points in time, which we will now get to in Chapter 3.

Chapter 3, Data Exploration and Interpolation

Looking at the 2018 Prep Wave score table we are given seven product capability dimensions: metadata, ML, collaboration, data activation, UX, data, and governance; although Forrester does not do as good a job as Ovum at describing what is or is not part of a dimension.

As you may know, Waves include a scoring table with details on each dimension [see lower left table in Calc Sheet below]. To breakdown the report, I performed a 7 dimension differentiation analysis using the same formula, identical to the distance-from-average method in the 2016 Prep 4DI, but not much of anything insightful came of it [lower right circle].

18-prep-calcs-pce19.5

Calculation Sheet 1

I then looked at all possible charting options in Microsoft Excel, but none of them arranged the data in any interesting way either. So I tried Violin charts, using one violin for each product capability dimension. A violin chart is one of a handful of chart types that can be useful for depicting distributions.

violin-chart-pce19.6

To assign product positions in the visualization, I used distance from average scores from the upper right table of the Calc Sheet. Each violin sits on a graph, which is graduated from top to bottom, with the top third ranging from +2 through zero, the middle third covering zero to negative 2, and the bottom third covering negative 2 to negative 4. All products from the 2018 Wave are then assigned to vertical positions on the graph based on scores from columns T through AC. Left to right positioning in the violin is simply alphabetic ordering.

Some products had noticeable patterns. Datameer for example is represented in the chart below as the shaded boxes, across a set of five different capability dimensions. We can easily see that Datameer scores either at the top of the class [above average] or at the bottom of the class [below average], but not in the middle on these dimensions.

datameer-viloins-pce19.7

In contrast, the product[s] from DWCH, represented in the chart below as the red boxes, is consistently found at the middle, average in comparison to other products, never appearing at the top or the bottom of these dimensions.

dwch-violins[pce19.8]

So one question pops up, are we seeing two different product development philosophies, where Datameer is still aiming for best of breed in niche functions, and DWCH is a more mature, or well-rounded product line? Well, the answer to that is an obvious yes, but we wouldn’t need violins to tell us that, if we knew much about them beforehand.

I can say that the violin visualizations have been the only visual exploration technique that showed promise for discovering something latent the 2018 Prep Wave data. If nothing else, these violins expose clusters of products within the larger set.

We can also identify Clusters of clusters, amidst the set of violins. The ML, Collaboration and UX violins [bottom row] have multiple similarities. Activation and Governance violins [upper row] have different similarities. So we end up with two distinct clusters of violins; and the Data and Metadata dimension violins fall outside of those two clusters.

Additional observations should rightfully be explained in more detail by showing a series of visualizations, but due to the length of time it takes to construct these visualizations manually, it seems like it would be better for readers if we just skip to the next section. Before moving on though, I would like to point out that I did see indication that Clearstory could be a bit of an outlier in comparison to other products. More on that later.

If we attempt to merge what we can from Ovum 2018 and Forrester 2018, we can see that we do get some high level agreement on what dimensions matter most in data prep.

Forrester	Ovum
Metadata	[listed as part of governance]
Governance	Governance [includes metadata]
UX	UX
Machine Learning	Collaboration and Machine Learning
Collaboration	Collaboration and Machine Learning
Data Activation	Data Output and Analytics
Data [meaning?]	n/a
n/a	Integration and Exploration
n/a	Manipulation
n/a	Administration [seems to half belong in the other axis also]

Personally, it seems better to keep collaboration and ML separated, as two different dimensions, although the two violins in the previous DWCH diagram are very similar and could in theory be combined. It also seems better to keep metadata and governance separated; so I am left unsatisfied with Ovum’s segmentation of dimensions, and I am also unsatisfied with Forrester’s lack of explication of what their dimension labels stand for. Perhaps we can build some sort of crosswalk over the two, to create an analysis that adds value. First, Ovum’s visualization of the top three products measured on their seven dimensions:

2018-prep-7-dim-top-3-tech[pce19.10]

We can decently estimate the rest of Ovum’s scoring on products not shown here by taking measurements from the spider charts located in each vendor’s section of the report.

To normalize equivalent scores we can simply double the Wave scores, and then average them with the Ovum scores:

18-wave-ovum-avg-pce19.11

While Forrester considers Paxata a true prep solution, Ovum and Bloor do not. Gartner meanwhile has classified both AYX and Paxata as stand-alone prep vendors, but as previously noted, Gartner is rather liberal in their inclusion scheme. Note also that Goetz’s ‘spreadsheet’ workflow vendors are the same ones covered in Ovum’s report, and the other types of workflow vendors including AYX are excluded.

As a result of filtering on the compatible dimensions between two 2018 reports, we are left with six products and six dimensions [which compress into four], with normalized product scores in each dimension. All four products are basically in the same order in every violin, except for Datameer which is positioned at 9.0 or higher in two dimensions, and 5.2 or lower on the other two dimensions.

The first letter of each vendor name identifies their respective node in the diagram, with the exception of DWCH, which is represented with a W.

18-prep-4-violins[pce19.12]

Note that these four violins are not built from distance from average scores, as the previous five violins were. Again, in the interest of time I am excluding some of the effort in what would be a more complete analysis, but the point of this chapter is primarily to explain how we can manipulate the data in attempt to join disparate sources, confirm, refine or adjust our confidence of other’s evaluations, etc.

One improvement over the previous five violins, the products which were clustered tightly in the governance and metadata dimensions are now fully spread out. One problem with these new violins though, the summarization of the data has forced a loss of the knowledge that Clearstory scores in UX and data output vary widely between the two reports, which may mean a reduced confidence level for those two new scores on this product. This is why it is always valuable to keep the underlying data closely at hand.

Conclusion

Through a simple data interpolation technique we were able to identify a clear outlier in a set of competitors; and through a simple synthetic union technique we were able to triangulate a better picture of the prep space and improve some levels of confidence that we can attach to single sources of product evaluations. Both of these measures are helpful in product selection processes.

References

Bartley, Paige. Selecting a self-service data prep solution, 2018

Borne, Kirk. https://insideanalysis.com/data-profiling-four-steps-to-knowing-your-big-data/

Goetz, Michelle. Vendor Landscape: Data Prep Tools, Feb 2016

Howard, Phillip. Bloor Spotlight, April 2016

Little, Cinny. The Forrester Wave: Data Prep Tools, Q1 [March] 2017

Little, Cinny. The Forrester Wave: Data Prep Solutions, Q4 2018

Randall, Lakshmi. Data Prep is not an Afterthought, Gartner 2014

Wells, Dave. Eckerson Group data preparation buyer’s guide, October 2017

Appendix A

2016-prep-5di-notopiclabel

Appendix B, Results of Analysis on Bloor’s 2016 Report

Before you can blend data you have to find it, so a prep solution needs discovery capability, but the term discovery has its own namespace in data mining, so Bloor describes the data finding capability as Cataloging. [Forrester uses the term discovery. This term is used a little too informally by many]. Cataloging capabilities include automated crawling, metadata collection, and search.

Preparation includes connector functionality, which in a diagram would make Cataloging a small box within the larger Prep box. This is consistent with Gartner’s taxonomy [cite [meatball source]]. Bloor also notes that as of 2016 most products could still be differentiated as one or the other.

Self-service [SS] prep capability is highly desirable. Bloor makes the distinction that cleansing, enrichment, pivoting, etc. are considered “integration / profiling / quality” capabilities in the context of an IT / technical user; but if the solution is designed for end users then these three functions can or should be referred to as SS prep. This is like saying a cooking tool is called a cuillere if it is used by a chef in a restaurant, but called a spoon if used by a layperson at home.

Bloor notes that the market players originated from different backgrounds. Those origins are basically: solutions with a BI origin; solutions with an integration origin; and pure-play solutions. Readers will have to painstakingly analyze the report in order to figure out which are which. Bloor declares that over twenty BI products claim to have prep capability but almost none of these actually offer required prep features, concurring with my previous statement in the second paragraph of this chapter.

Bloor uses a circular, high-score-is-in-the-center graph to summarize product evaluation results but divides the circle into three segments based on relationship to average, so segment number one contains products with high scores, segment two has products with medium scores, and segment three are the low scoring products. The method for determining the vector for radial positioning remains a mystery, other than it is some “combination of innovation and overall score.”

Products are also given a color in the diagram to indicate membership to one of four offering types: blue for prep only; orange for cataloging [pure plays]; red for prep and cataloging [pure plays]; and green for prep with analytics [stand-alone]. Here is a rollup of the first three:

Prep and catalog pure plays: Unifi, Tamr, and TICS. All previously clearly noted as pure plays.

Cataloging pure plays: Alation, and Waterline. Both previously clearly noted as pure plays.

Prep pure plays: Trifacta, and Paxata. Also both previously clearly noted as pure plays.

So far so good but then it starts to get tricky; the text breaks up the remaining landscape into two new, slightly different category labels, listing them in what must be described as a questionable sequence for the reader to make sense of:

[#4]: Prep Hybrid BI / Standalone, which is totally consistent with the products each coded green in the diagram and labeled as:

Prep with analytics: Freesight, Alteryx, Datameer, Clearstory, Datawatch, and Rocket.

But this [fourth] Prep Hybrid BI / Standalone section in the text also includes SAS, SAP, IBM, and Oracle, even though they were coded blue in the diagram, the label described simply by ‘Prep.’

And then [unexpected] comes a fifth category,

[#5]: Prep Hybrid Data Quality / Data Integration Standalones: Informatica, Talend, and Experian. Each of these three were also coded blue in the diagram, as Prep, which seems to make sense. Bloor notes that SAS, SAP, IBM, and Oracle could also be included under this [fifth] group, which would bring those four back in line with the diagram coding.

So product classification in the Bloor report at this point has become a little too opaque, a little contradictory, about 40% fuzzy, and way too confusing to be worth additional effort to untangle. The final paragraph of the report uses phrases “ones to watch,” and “honorable mention,” which also fail to provide clarity.

This might be the most confusing report I have ever read. I suspect that 95% of the audience for this report either paid a third party to analyze it and to explain it in a clear derivative, or threw it in the trash. [My contact # is 703-237-8379, btw]. Frankly, the summary diagram is also a dud. There is little clarity into the node positions, or which product is superior to which.

Alas, we have not gone through all of this with a hope to rely heavily on the original authors, but rather to take what parts of a report might be useful to us, and apply them to our own discovery. A summary of the 10 Bloor descriptions might have something that can be used to evaluate positioning in the Prep 4DI [link]. Ugly, as yet unrefined notes for that work which are not fit for public consumption are below.

Paxata: good semantics, drive a join+ recommender. The interface is bi-directional, orchestrates both a prep environment and a BI environment. A top tier solution. Confirms 4di.

Trifacta: possibly the number one product for technical users. Top tier. Confirms 4di

Alteryx: visual workflow capability. Results often output to BI environments like tableau, Microsoft or qlik. Top tier. Conflicts with 4DI. only 3.1 on 4di

Datameer: pre built functions for handling major data formats. “Honorable mention.” 4.0 on 4di

Datawatch: source connectors; gov, quality and profiling. Good bet for the future. #3 is 5.0 on 4di

Oracle big prep: stat based ML and NLP support a recommender. “Behind the curve” [due to questions about why cloud-only]. Totally Conflicts w 4di. 4.6 on 4di

Sap prep: runs on hana. “Behind the curve.” Confirms 4di

Informatica: app and db connector wizard for technical users. Federated access for business users. ML and semantics drive a recommender. In the mix. Conflicts with 4di. only 2.9 on 4di

Talend: very easy to use interface. Not fully featured. ML and semantics drive a recommender. Good bet for the future. Positioning seems pretty close to confirmed on the 4di. 3.3 on 4di

Microstrategy: the one exception in our group; a BI solution with substantial SS prep. Confirms 4di

6 positions confirmed, 5 positions conflicted.

Scoring estimates for Synthetic Union of 2 sources:

Bloor: the top tier is close to 6 on the 4di. In the mix is next [approx 5.3], Honorable mention seems to be next [approx 4.7], and good bet is fourth [approximately 4].

Recommended Adjustments to 4di: All adjustments = 10%

Alteryx – move up .6, to 3.7. Datameer – slight move up, .44, maybe to 4.4. Data watch – move back .5, to 4.5. Oracle prep – unclear; maybe 10% move back to 4.2. Informatica – move up to 3.3.

Information Service Blog

All About Information Management

Category Archives: visualization

Data Preparation Software, circa 2018