Query Requirements Graph

This document presents a decision tree which assists a user in selecting a technology for querying big data, based upon the user’s requirements. The decision tree is presented in the format of a map, which provides the user with the context of related system concepts such as data storage, data preparation, etc.  

Table of Contents

  • Graph Description
  • Diagramming Heuristics
  • Legend
  • Base Map and related Toolkit

Graph Description

This Query Requirement Graph is a technology selection decision tree fit onto / oriented onto a pre built base map, which represents a landscape of business decision system concepts.

Diagramming Heuristic

Concepts of the decision tree, represented as the colored cells and the lines between them, must conform to the layout of a base map, represented as the mosaic of small square cells.

As a rule, the tree must use the base map locations for all decision point concepts. 

When the user has a requirement such as ‘query non relational data,’ which involves two separate map components, the tree must also draw a connection between the two. In this case ‘query,’ and ‘non relational,’ combine to complete a more complex concept.

As a consequence of the base map forcing the decision tree to use the map’s concept locations, the tree cannot follow any particular direction, as standard decision paths normally do.

Potentially, map directed decision tree paths may have to cross over previously charted paths. To reduce visual complexity in this situation, the graph should send users to another sheet and present only the ‘extended’ parts of tree there.

Legend

Light blue boxes contain questions about user requirements. Darker blue lines are decision pathways. Green boxes are solutions that meet the users requirements.

Lavender boxes indicate questions, decisions or solutions that are relevant to tree branches represented on a second sheet.

Base map and related toolkit

The base map is part of a toolkit which provides additional knowledge base support capability including:

  • Information Architecture [IA] and Dimensional Scaling
  • Concept Decomposition
  • Technology Selection

Base Map

The base map is built in Excel and users can quickly find a concept in their area of interest [if their area of concern is related to data analysis]. It serves as a proxy and metamodel of a companion database that provides a wealth of information on data analysis technologies.

Most of the concepts in the map contain index codes that reference related information in online libraries and databases such as MeSH, UNESCO, etc.

The map can be easily edited. Users can add new concepts, sub concepts, or hide information in order to abstract lower level details.

Toolkit

The map aligns with companion documents including a taxonomy and a database.

The taxonomy supports IA and dimensional stacking, and benefits users by [1], providing a knowledge organization scheme, and [2] providing a mechanism for map expansion, compression, and navigation between different levels of maps [dimensions].

The database is a knowledge base containing business and technology information on thousands of data analysis technologies. It supports the decomposition of concepts into more detail; and technology selection.

Governance Info

Date of the base map data:                          2020

Source of the base map data:                      Dozens of sources have gone into creation of the base map.                                                                                                                

Name of the base map, toolkit, and query requirements graph creator:                                                                      Russell Reinsch

Source of the decision tree:                         Tomer Shiran. https://www.slideshare.net/dremio/bi-on-big-data-strata-2016-in-london?next_slideshow=8

Data Preparation Software, circa 2018

18-prep-4-violins[pce19.12]

Data mining / exploration and discovery is a mesmerizing part of data analysis. The tedious grunt work of preparing the data for analysis, is not such an exciting topic. Given the amount of resources organizations dedicate to the task of data preparation which are estimated to be from half to three quarters of all data processing expenditures, it seems only reasonable to devote an appropriate amount of energy to continued written clarifications on the subject.

It makes sense that we frame the discourse of this paper through a definition for the topic itself; all too often, the term “preparation” is conflated with other terms that ultimately increase confusion on the subject. In reality, definitions evolve over time, and what was considered to be part of [x] in earlier years is often described as something different in the next generation.

Current definitions for data prep vary, depending on whom you ask; but the fact that the activity is of upmost importance is not in question: Forrester author Michele Goetz sums it up nicely, declaring “New big data environments, faster data integration, and analytic appliances aren’t the answer [to today’s data analysis challenges]. Your analysts need better tools to speed up data preparation efforts…” [Vendor Landscape: Data Prep Tools, 2016]. One need only look at the volume of venture capital afforded to this space, to appreciate the market value. Of the notable private firms with under 10,000 employees operating in the space, a whopping $100 million in VC funding has been allocated [on average] to each vendor since 2016.

While searching for a definition, this paper will skim important concepts which are closely related to preparation, helping to refine the notions of what prep is, by also looking at what it is not. Second, we will analyze some weaknesses in commercial reports on preparation software; and in the last chapter, manipulate some published data on commercial prep software, and create summary visualizations of the analysis.

Table of Contents

Chapter 1 Definitions Kind of boring
Chapter 2 Describing analysis of commercial analysts’ reports, via text Painfully boring
Chapter 3 Visualizing analysis and synthetic unions of analysts’ reports Fascinating

Chapter One, Defining Preparation

So what exactly is data prep? The following examination uses the most credible of published analyst reports and vendor white papers, [which are not all that different] as secondary resources for answering this question. [Ultimately, it seems to boil down to Ovum’s statement on page 5].

In Bloor’s Self-Service Data Preparation Spotlight Paper, author Phillip Howard provides domain specific jargon that he believes are key components of the larger data prep concept [Howard, 2016]:

Blending: this is combining datasets.

Munging: this is transformation.

Shaping: this is data preparation.

Wrangling: this is the combination of prep and transformation.

Dave Wells lays out a list of prep functions in Eckerson Group’s Data Prep Buyer’s Guide. Looking at a list of functions as a whole can give us a description of prep, which might serve as equivalent to a definition: Visual exploration is the critical first step. Exploration includes a summary of the salient characteristics of a dataset. Pattern discovery is part of visual exploration. Patterns can be basic statistical distribution of values; or more complex, clusters of attributes based on similarity. Patterns are a man’s best friend, but that is a concept for another book. Note that visual exploration is aka discovery.

Transformation: for Wells, this function is composed of improvement, enrichment, and formatting. Wells defines improvement as standardizing field values and conforming data to common formats. Enrichment is defined as derivation and appending. Examples are “calculating age based on birthdate” and extending a street address to include a longitudinal geocode. Formatting is defined as sorting records, sequencing fields, masking, and finalizing records for output.

Cleansing: this function improves data quality. Defective data can be removed or replaced “with derived, default, or most-probable values.

Blending: involving the combination of data from multiple sources.

Lineage tracking: related to governance, and confidence and trust.

Metadata and cataloging. Catalogs collect metadata, and that metadata aids in many aspects of operations including lineage tracking, governance, and search.

Data modeling: capability of this function extends only to inferring whether a data model is canonical or semantic.

Integration and interoperability. There is a problem of having to load data from access tools to discovery tools to prep tools to analysis tools and finally to reporting tools in order to complete processing [Wells, 2017]. The problem is essentially cost.

Rounding out the list, are: sharing [and reuse]; user defined functions; and scalability and reliability.Most people are apparently fans of the use of visuals to simplify descriptions of a subject. In trying to picture concepts of things like data preparation however, diagrams and visualizations almost always fall short. In the case of data preparation, the steps of a workflow do not always progress sequentially. This can be a problem for simplification [and thus definition], as Jane may describe the steps as 1, 2, 3; but John may say the steps are 2,4,  3,4,  4,5.

In Data Prep is not an Afterthought, Gartner author Lakshmi Randall rightly diagrams data preparation as an iterative process, with a workflow that repeatedly loops back to its own earlier ‘steps.’

prep-steps-pce19.2

Randall goes on to say that exploration of the data and identification of the transformations needed are parts of data prep. In the ensuing bullet points in her report Randall points to profiling, cataloging, statistic computations, metadata, and transformation as key prep techniques [Randall, 2014]. Taking a different view of workflows, Kirk Borne describes prep as step two in data science [Inside Analysis, 2014], where transformations are external to data preparation, placing his view in disagreement on how closely the activity of transformation fits with prep.

Perhaps it would be good to frame the concept inclusively, in a set theory kind of way, as opposed to exclusively. The following Venn-style diagram, created to summarize a data integration text document, presents transformation [in green] as a segment of both data quality functions [encapsulated within the yellow border] and core data integration functions [encapsulated within the blue border]. Note however, that in this diagram transformation has a 2-point relationship w blue, whilst only a 1-point relationship w yellow. We could also say data prep would in fact be very similar to the range of yellow components in the diagram below.

colorwheel-to-scale-pce19.3

Whether x comes before y in a workflow, or whether they should be viewed as separate pieces in a diagram, it might be conceptually more productive to think of these entities simply as process objectives. One objective is collection of the data. Another objective is understanding the condition of the data, which often involves building a catalog for user reference. Another objective is to improve the data, which can involve cleansing, restructuring, normalizing, and integrating data [aka blending]. [Many professionals consider data modeling as categorically part of this objective as well, although there can be significant lag times between the preparation and the modeling].

A fourth objective is the stewardship and governance of the data and workflow. This objective has a close relationship with access control rules and security policies. Another objective is to maintain the knowledge about the data and the analysis of particular workflows, so that other users can reuse the knowledge instead of having to figure out for themselves what someone else already knows about the data or has already done in terms of analysis [this is aka ‘sharing’ of fully developed queries, etc]. The objective of ‘sharing’ takes us back to the Catalog and its twin sister, the metadata repository.

Catalogs and metadata repositories are hellish expensive to build and maintain unless the functions are automated. The value from catalogs and metadata come back in the forms of improved retrieval of assets; improved proof of adherence to governance policies; and improved identification of data quality issues.

Chapter 2, Analyzing Commercial Reports

As noted in previous articles, the charts in MQs and Waves are over-generalizations of product capabilities. MQs are less informative than some other reports for product specifics; in comparison to Waves for example which which provide quantitative scoring and taxonomic hierarchies as part of their reports. In any case we want to extract additional value from what is in these and other documents. Starting with 2016, we will work through 2017, and 2018.

2016

A rather detailed analysis of Gartner’s 2016 Market Guide for Self-Service Data Prep has already been completed here. That [4DI] analysis looks at over 20 vendors along eight different capability dimensions; but in reality, the Gartner Guide is a very BI centric report. In that the BI product field is much larger than the data prep field, in terms of vendors, the Gartner Guide is arguably including too many vendors that are not actually focused on data prep workflows.

One of the takeaways from the 4DI analysis of the 2016 guide is that the capability dimensions of transformation; tool integration; use of ML; and metadata and cataloging were most salient. [Diagram of said dimensions in Appendix A].

For the 2016 section of this chapter, we can map another resource to that prior work: Bloor’s 2016 SS Prep and Cataloging report; however, due to various forces, the truly valuable results of analysis on Bloor’s report deviate a bit too much from the main goal of this paper, so that content has been relegated to Appendix B.

2017

Bloor’s 2016 report and Forrester’s 2017 Prep Wave report were released 13 months apart. Some of the text on Datawatch [DWCH] is literally identical, which makes it obvious how much these reports rely on information fed to them by the vendor. Forrester covers only seven products in the 2017 Prep Wave [Little, 2017], all of which they describe as stand-alone[s], not part of a BI tool. The pecking order on the blue Wave diagram is clear: my ranking would be weighted heavily on the “Current Offering” axis [which = product capabilities], resulting in: Paxata, Trifacta, Unifi, Alteryx, SAS, Datawatch, and Oracle, in that order.

The capability dimensions in Forrester’s score breakdown table are: discover and blend, standardize and enrich, transform, deliver, and share. These dimension terms are not used again in any clarifying way in the 2017 Wave document, interested readers have to go back 13 months and locate Vendor Landscape: Data Prep Tools authored by Goetz to get a somewhat corresponding picture of the underlying taxonomy. Goetz’s Landscape doc wholly fails to clarify how the Transform dimension relates to the other dimensions; but one good point it does raise is how tool selection must depend on use case.

With the exception of Unifi, scores for Deliver and Sharing look the same. Note that Oracle is excluded from the score-breakdown table because they did not participate. The table below is a summary of the text content of the 2017 Prep Wave.

How Forrester ranked them Strength Weakness
Trifacta ML Search and collaboration
Paxata ML, semantic analysis, bidirectional integration w BI tools. Search and connector optimization
Alteryx Workflow interface Search, ML
Datawatch Macros ML
SAS Data Loader for Hadoop Connectors, macros Search, ML, and collaborate features
Unifi ML, NLP Search
Oracle Big Prep Cloud ML driven recommender

 

2018

Ovum

Ovum provides a report in 2018 that is helpful for refining descriptions; and potentially also for triangulating evaluation accuracy of other reports. In Selecting a SS Data Prep Solution [Bartley, 2018], Ovum declares: ““Data prep” is no longer a clean, discrete market; it is overlapping, messy, and deeply intertwined with the information governance market.”

Ovum assesses 2018 prep products in seven technology dimensions:

Integration and exploration [connecting to sources, and initial exploration and profiling of data].

Manipulation [core prep: cleansing, blending, transformation, enrichment, and modeling].

UX and UI

Data output and analytics [export, and native analysis].

Collaboration and ML [ML powered functionality for user collaboration].

Administration [architecture, deployment, and processing].

Data governance [metadata and catalog management, and security of data]. Bartley notes early on that a key product capability that must be available in order for the product to be included in the report, is cataloging and or metadata repository.

Bartley points out the technology capabilities for Governance, and collaboration and ML typically set the top products apart; in contrast to product scores in administration, integration and evaluation, and manipulation dimensions, all closely grouped, suggesting “relative market maturity” in these areas.

Current trends in the overall market of most interest IMHO are – “Differentiators include machine learning guided functionality, connectivity to analytics tools, and governance features.” The report also says “SS data prep, traditionally served as a feeder to the SS analytics ecosystem… [is] increasingly feeding prepped data into ML models.” The current fascination with ML absolutely dictates that we facilitate clarity on what exactly ML does for us in these data prep tools.

Bartley lists the following: ML can power detection of workflow actions; detection of outliers in data; detection and ingest of files including files classified as sensitive; automatic deduplication; and recommendations for joining data and sources. These can all be rolled up into ‘automated suggestion / guidance functionalities.’

Wells lists the following: in cleaning: Machine learning and pattern mining are helpful [functions] to determine most probable values” based on related or surrounding data; and ML can detect deficiencies and recommend actions to improve quality. ML can recommend blending techniques.

ML can actually “increase the degree to which data prep becomes a collaborative effort [because] as each analyst” does their thing, their actions and results become part of a shared KB. This doesn’t sound like ML per se, but if we look at it as part of a system which can make inferences about data and make suggestions to users, then perhaps.  

I think Howard better generalizes ML: prep processing activities are food for ML engines that can make future prep activities easier, perhaps by predicting how you will want to process new data from a new source, potentially even ranking a list of options.

Forrester

Forrester’s 2018 Data Preparation Solutions Wave [Little, 2018] also scores products on seven dimensions, which understandably are not the same dimensions as Ovum’s. However, Forrester’s 2018 dimensions are completely different than they were in the 2017 Data Preparation Tools Wave written by the same author, so we are not able to compare dimensions over consecutive years or across time. We are able to at least cluster products more closely with like-products for apple to apple comparisons at two points in time, which we will now get to in Chapter 3.

Chapter 3, Data Exploration and Interpolation

Looking at the 2018 Prep Wave score table we are given seven product capability dimensions: metadata, ML, collaboration, data activation, UX, data, and governance; although Forrester does not do as good a job as Ovum at describing what is or is not part of a dimension.

As you may know, Waves include a scoring table with details on each dimension [see lower left table in Calc Sheet below]. To breakdown the report, I performed a 7 dimension differentiation analysis using the same formula, identical to the distance-from-average method in the 2016 Prep 4DI, but not much of anything insightful came of it [lower right circle].

18-prep-calcs-pce19.5

Calculation Sheet 1

I then looked at all possible charting options in Microsoft Excel, but none of them arranged the data in any interesting way either. So I tried Violin charts, using one violin for each product capability dimension. A violin chart is one of a handful of chart types that can be useful for depicting distributions.

violin-chart-pce19.6

To assign product positions in the visualization, I used distance from average scores from the upper right table of the Calc Sheet. Each violin sits on a graph, which is graduated from top to bottom, with the top third ranging from +2 through zero, the middle third covering zero to negative 2, and the bottom third covering negative 2 to negative 4. All products from the 2018 Wave are then assigned to vertical positions on the graph based on scores from columns T through AC. Left to right positioning in the violin is simply alphabetic ordering.

Some products had noticeable patterns. Datameer for example is represented in the chart below as the shaded boxes, across a set of five different capability dimensions. We can easily see that Datameer scores either at the top of the class [above average] or at the bottom of the class [below average], but not in the middle on these dimensions.

datameer-viloins-pce19.7

 

In contrast, the product[s] from DWCH, represented in the chart below as the red boxes, is consistently found at the middle, average in comparison to other products, never appearing at the top or the bottom of these dimensions.

dwch-violins[pce19.8]

So one question pops up, are we seeing two different product development philosophies, where Datameer is still aiming for best of breed in niche functions, and DWCH is a more mature, or well-rounded product line? Well, the answer to that is an obvious yes, but we wouldn’t need violins to tell us that, if we knew much about them beforehand.

I can say that the violin visualizations have been the only visual exploration technique that showed promise for discovering something latent the 2018 Prep Wave data. If nothing else, these violins expose clusters of products within the larger set.

We can also identify Clusters of clusters, amidst the set of violins. The ML, Collaboration and UX violins [bottom row] have multiple similarities. Activation and Governance violins [upper row] have different similarities. So we end up with two distinct clusters of violins; and the Data and Metadata dimension violins fall outside of those two clusters.

Additional observations should rightfully be explained in more detail by showing a series of visualizations, but due to the length of time it takes to construct these visualizations manually, it seems like it would be better for readers if we just skip to the next section. Before moving on though, I would like to point out that I did see indication that Clearstory could be a bit of an outlier in comparison to other products. More on that later.

If we attempt to merge what we can from Ovum 2018 and Forrester 2018, we can see that we do get some high level agreement on what dimensions matter most in data prep.

 

Forrester Ovum
Metadata [listed as part of governance]
Governance Governance [includes metadata]
UX UX
Machine Learning Collaboration and Machine Learning
Collaboration Collaboration and Machine Learning
Data Activation Data Output and Analytics
Data [meaning?] n/a
n/a Integration and Exploration
n/a Manipulation
n/a Administration [seems to half belong in the other axis also]

Personally, it seems better to keep collaboration and ML separated, as two different dimensions, although the two violins in the previous DWCH diagram are very similar and could in theory be combined. It also seems better to keep metadata and governance separated; so I am left unsatisfied with Ovum’s segmentation of dimensions, and I am also unsatisfied with Forrester’s lack of explication of what their dimension labels stand for. Perhaps we can build some sort of crosswalk over the two, to create an analysis that adds value. First, Ovum’s visualization of the top three products measured on their seven dimensions:

2018-prep-7-dim-top-3-tech[pce19.10]

We can decently estimate the rest of Ovum’s scoring on products not shown here by taking measurements from the spider charts located in each vendor’s section of the report.

To normalize equivalent scores we can simply double the Wave scores, and then average them with the Ovum scores:

18-wave-ovum-avg-pce19.11

While Forrester considers Paxata a true prep solution, Ovum and Bloor do not. Gartner meanwhile has classified both AYX and Paxata as stand-alone prep vendors, but as previously noted, Gartner is rather liberal in their inclusion scheme. Note also that Goetz’s ‘spreadsheet’ workflow vendors are the same ones covered in Ovum’s report, and the other types of workflow vendors including AYX are excluded.

As a result of filtering on the compatible dimensions between two 2018 reports, we are left with six products and six dimensions [which compress into four], with normalized product scores in each dimension. All four products are basically in the same order in every violin, except for Datameer which is positioned at 9.0 or higher in two dimensions, and 5.2 or lower on the other two dimensions.

The first letter of each vendor name identifies their respective node in the diagram, with the exception of DWCH, which is represented with a W.

18-prep-4-violins[pce19.12]

Note that these four violins are not built from distance from average scores, as the previous five violins were. Again, in the interest of time I am excluding some of the effort in what would be a more complete analysis, but the point of this chapter is primarily to explain how we can manipulate the data in attempt to join disparate sources, confirm, refine or adjust our confidence of other’s evaluations, etc.

One improvement over the previous five violins, the products which were clustered tightly in the governance and metadata dimensions are now fully spread out. One problem with these new violins though, the summarization of the data has forced a loss of the knowledge that Clearstory scores in UX and data output vary widely between the two reports, which may mean a reduced confidence level for those two new scores on this product. This is why it is always valuable to keep the underlying data closely at hand.

Conclusion

Through a simple data interpolation technique we were able to identify a clear outlier in a set of competitors; and through a simple synthetic union technique we were able to triangulate a better picture of the prep space and improve some levels of confidence that we can attach to single sources of product evaluations. Both of these measures are helpful in product selection processes.

References

Bartley, Paige. Selecting a self-service data prep solution, 2018

Borne, Kirk. https://insideanalysis.com/data-profiling-four-steps-to-knowing-your-big-data/

Goetz, Michelle. Vendor Landscape: Data Prep Tools, Feb 2016

Howard, Phillip. Bloor Spotlight, April 2016

Little, Cinny. The Forrester Wave: Data Prep Tools, Q1 [March] 2017

Little, Cinny. The Forrester Wave: Data Prep Solutions, Q4 2018

Randall, Lakshmi. Data Prep is not an Afterthought, Gartner 2014

Wells, Dave. Eckerson Group data preparation buyer’s guide, October 2017

Appendix A

2016-prep-5di-notopiclabel

Appendix B, Results of Analysis on Bloor’s 2016 Report

Before you can blend data you have to find it, so a prep solution needs discovery capability, but the term discovery has its own namespace in data mining, so Bloor describes the data finding capability as Cataloging. [Forrester uses the term discovery. This term is used a little too informally by many]. Cataloging capabilities include automated crawling, metadata collection, and search.

Preparation includes connector functionality, which in a diagram would make Cataloging a small box within the larger Prep box. This is consistent with Gartner’s taxonomy [cite [meatball source]]. Bloor also notes that as of 2016 most products could still be differentiated as one or the other.

Self-service [SS] prep capability is highly desirable. Bloor makes the distinction that cleansing, enrichment, pivoting, etc. are considered “integration / profiling / quality” capabilities in the context of an IT / technical user; but if the solution is designed for end users then these three functions can or should be referred to as SS prep. This is like saying a cooking tool is called a cuillere if it is used by a chef in a restaurant, but called a spoon if used by a layperson at home.

Bloor notes that the market players originated from different backgrounds. Those origins are basically: solutions with a BI origin; solutions with an integration origin; and pure-play solutions. Readers will have to painstakingly analyze the report in order to figure out which are which. Bloor declares that over twenty BI products claim to have prep capability but almost none of these actually offer required prep features, concurring with my previous statement in the second paragraph of this chapter.

Bloor uses a circular, high-score-is-in-the-center graph to summarize product evaluation results but divides the circle into three segments based on relationship to average, so segment number one contains products with high scores, segment two has products with medium scores, and segment three are the low scoring products. The method for determining the vector for radial positioning remains a mystery, other than it is some “combination of innovation and overall score.”

Products are also given a color in the diagram to indicate membership to one of four offering types: blue for prep only; orange for cataloging [pure plays]; red for prep and cataloging [pure plays]; and green for prep with analytics [stand-alone]. Here is a rollup of the first three:

Prep and catalog pure plays:       Unifi, Tamr, and TICS. All previously clearly noted as pure plays.

Cataloging pure plays:                   Alation, and Waterline. Both previously clearly noted as pure plays.

Prep pure plays:                              Trifacta, and Paxata. Also both previously clearly noted as pure plays.

So far so good but then it starts to get tricky; the text breaks up the remaining landscape into two new, slightly different category labels, listing them in what must be described as a questionable sequence for the reader to make sense of:

[#4]: Prep Hybrid BI / Standalone, which is totally consistent with the products each coded green in the diagram and labeled as:

Prep with analytics:                        Freesight, Alteryx, Datameer, Clearstory, Datawatch, and Rocket.

But this [fourth] Prep Hybrid BI / Standalone section in the text also includes SAS, SAP, IBM, and Oracle, even though they were coded blue in the diagram, the label described simply by ‘Prep.’

And then [unexpected] comes a fifth category,

[#5]: Prep Hybrid Data Quality / Data Integration Standalones: Informatica, Talend, and Experian. Each of these three were also coded blue in the diagram, as Prep, which seems to make sense. Bloor notes that SAS, SAP, IBM, and Oracle could also be included under this [fifth] group, which would bring those four back in line with the diagram coding.

So product classification in the Bloor report at this point has become a little too opaque, a little contradictory, about 40% fuzzy, and way too confusing to be worth additional effort to untangle. The final paragraph of the report uses phrases “ones to watch,” and “honorable mention,” which also fail to provide clarity.

This might be the most confusing report I have ever read. I suspect that 95% of the audience for this report either paid a third party to analyze it and to explain it in a clear derivative, or threw it in the trash. [My contact # is 703-237-8379, btw]. Frankly, the summary diagram is also a dud. There is little clarity into the node positions, or which product is superior to which.

Alas, we have not gone through all of this with a hope to rely heavily on the original authors, but rather to take what parts of a report might be useful to us, and apply them to our own discovery. A summary of the 10 Bloor descriptions might have something that can be used to evaluate positioning in the Prep 4DI [link]. Ugly, as yet unrefined notes for that work which are not fit for public consumption are below.

Paxata: good semantics, drive a join+ recommender. The interface is bi-directional, orchestrates both a prep environment and a BI environment. A top tier solution.                                                                 Confirms 4di.

Trifacta: possibly the number one product for technical users. Top tier.                                                                                                  Confirms 4di

Alteryx: visual workflow capability. Results often output to BI environments like tableau, Microsoft or qlik. Top tier. Conflicts with 4DI.                                                                                                                                                 only 3.1 on 4di

Datameer: pre built functions for handling major data formats. “Honorable mention.”                                                                      4.0 on 4di

Datawatch: source connectors; gov, quality and profiling. Good bet for the future.                                                                               #3 is 5.0 on 4di

Oracle big prep: stat based ML and NLP support a recommender. “Behind the curve” [due to questions about why cloud-only]. Totally Conflicts w 4di.                                                                                                             4.6 on 4di

Sap prep: runs on hana. “Behind the curve.”                                                                                                                                                   Confirms 4di

Informatica: app and db connector wizard for technical users. Federated access for business users. ML and semantics drive a recommender. In the mix. Conflicts with 4di.                                                                     only 2.9 on 4di

Talend: very easy to use interface. Not fully featured. ML and semantics drive a recommender. Good bet for the future. Positioning seems pretty close to confirmed on the 4di.                                                 3.3 on 4di

Microstrategy: the one exception in our group; a BI solution with substantial SS prep.                                                                      Confirms 4di

6 positions confirmed, 5 positions conflicted.

Scoring estimates for Synthetic Union of 2 sources:

Bloor: the top tier is close to 6 on the 4di. In the mix is next [approx 5.3], Honorable mention seems to be next [approx 4.7], and good bet is fourth [approximately 4].

Recommended Adjustments to 4di: All adjustments = 10%

Alteryx – move up .6, to 3.7. Datameer – slight move up, .44, maybe to 4.4. Data watch – move back .5, to 4.5. Oracle prep – unclear; maybe 10% move back to 4.2. Informatica – move up to 3.3.

 

The Data Preparation Landscape, circa 2016, and a Feature Engineering Heuristic for Product Differentiation

2016-prep-4di-blue

Given data points from the Market Guide for Self-Service Data Preparation published 25 August 2016 by Gartner, which looks like this,

explor,metadata,ML-in-BI,integration,ss-data-prep

Figure 1. Meatball chart

We are challenged to efficiently make use of the data in this visualization. Other than the seven solid meatballs in the BI Tools Integration category it is tough to figure out which products can truly say they differentiate from the field, along any of the other eight capability dimensions. We can strive to explore construction of a better visualization for summary of the data in figure 1, through simple data interpolation rules. Our primary assumption for this analysis will be that product differentiation has value.

Product differentiation: Step 1: Data Exploration:

For the sake of simplifying a new exploratory model design it makes sense to extract key features that have the potential to be cross referenced with other data sources, such as analyst reports and validated literature. 25 products and eight of the 10 associated rating criteria were selected from the report in Figure 1. Meatball chart.

Step 2: Data Transformation: Convert the selected features from the meatball chart into a table in Excel, using integers 0 to 6 to represent the meatballs. Apply color-coding individually to each column.

Step 3, 4 and 5: Modeling: Calculate average scores for each of the eight dimensions [row 28]. Calculate distance from average for each product [column J], and calculate the number of top scores [in green] received by each product, and tally them in column K.

2016-prep-8di

Figure 2: Transformation table A

Step 6: Assign each product to the dimension where it received the highest distance from average in column J, logging the assignments to column L [shown in figure 4].

As a result of the positioning heuristics, no products differentiated themselves from the rest of the field in three of the capability dimensions [A, B, and C], so three of the segments on a sketch would have no nodes at all. Those dimensions could be discarded, which would reduce the interpolation space down to five dimensions, and might also make it easier to apply some type of rule for plotting vectors [pulling or pushing nodes toward adjacent segments where a product also scored well / poor].

It would mean however that we were faced with the problem of optimizing the segment order, in order to get good results from a pulling-rule.

Step 7 and 8: Explicatory Visualization of Modeling: Prepare a sketch surface for a five dimension interpolation [“5DI”] with a radius of 4 – 0 [see [this post] for background on 5DI sketching]. Position products into the center vector of the appropriate segment, assigning specific positions based on respective distance values listed in column J.

2016-prep-5di-notopiclabel

Figure 3: Explicatory 5DI sketch

A problem surfaces for this analysis due to the simple plotting rule, because all products in a segment plot onto the same point [with one exception [Power BI, at the 5:30 clock position], creating super dense clusters, such as the large circle node shown at the two o’clock position. For better differentiation of individual nodes, the heuristics will need to be improved, or another interpolation technique will need to be used for modeling this data.

Solution 1: One technique for solving the cluster density problem would be to go back to the table in figure 2 and reorder the matrix, creating a pseudo heat map to serve as the final explanatory visualization.

Step 9: Reorder the matrix and rerun the color-coding for the whole matrix

explanatory-pseudo-heatmap

Figure 4: Pseudo heat map

Initial segments are assigned in column L which uses scores summed from subtracting H29 from H3 for example. Note how the reordered matrix shows the stair step effect of the heuristics on the assignments to L.

As a result of the heuristics used, products below average in column K are never actually differentiated. Pentaho for example, is not necessarily differentiated in Cataloging, but it is definitely not differentiated in columns A – D or G or H. Assignment becomes a default action.

Insights from the interpolation gained so far:

The averaging function [row 28] that sets up column J does indicate that Integration, ML, and Cataloging are the most differentiated capabilities, in that order. In contrast, basically every product is providing Support for Data Sources, and Curation and Governance, and Transformation capabilities. Only 16% of the field is below average in Exploration and Profiling capabilities, but this is not an area where any products stand out according to the column J rule.

Cataloging and Metadata is an area where the weaker third of the products start to fall short. A few products score well here, but fall short on the rarer two categories [G and H].

½ of the field offers use of ML technology to automate the preparation process; and according to the meatball chart you either get ML capability or you do not, there are no fuzzy mid-strength offerings. Nine products offer ML but do not make the cut for the most rare category, Integration. [So given that the analysis points to ML as the second most hard-to-acquire capability, the reflexive recommendation would be that these nine products should highlight their ML capability first and foremost in their marketing and sales positioning].

Integration with other BI and Analysis tools is the rarest product capability in the field. This is depicted in the original meatball table fairly well, but reinforced multiple times in the results of the heuristic depicted in figure 4.

Some notes on definitions for the capability categories from the origin source for this analysis from Gartner:

Data Exploration and Profiling: includes cataloging data assets [awkwardly Cataloging is also presented as a separate, top level category]; and discovery and recording of data lineage of transformations [not to be confused with Transformation, also a separate top level category].

Metadata Repository and Cataloging: includes cataloging of data sources, transformations, and data lineage [which are as previously noted, also included in the Data Exploration and Profiling categories].

Solution 2: If we really needed a better visualization, potential solutions to the dense cluster problem might be to relax the rule for pushing the data points toward the center; or to fit the sizes of segment spaces to the number of products in each category; and then use some additional information from other sources to modify the column L values that are responsible for the density of the cluster balls in the first place; leading to steps 10, 11 and 12: Tally the number of products assigned to each dimension [row 30]. Position the boundaries of the five segments on the sketch surface so that they correspond to the number of products per category; and then reduce the impact of the values in column L by expanding the radius scale of the chart from where it was in figure 3 [0 – 4], to zero to six. [and] Step 13: Get data from other source[s] synthetically join them to the existing data, in effect, with the intention of modifying the position of each product enough to reduce the density of the product clusters.

Solution 3: Fortunately there is a third option, which ultimately resulted in the desired outcome.

The real problem area is breaking up the dense clusters, so the goal at this point is to get more accuracy applied to the current positioning through adjustments of data values.

Step 10: write-in which dimensions each product scored 2nd and third, into columns M and N. Now have L, M and N noted as a three digit number in column W, labeled ‘subgroup.’

Step 11:  calculate the percentage of products that score above average in each segment, in row 31, which results in a ranking of the columns by rarity. Keeping H, G, and F as segments 1, 2, and 3, column C is ranked as segment 4, column B is 5, E is 6, A is 7 and D is 8.

Step 12: Use M, N and K to calculate a more precise distance from zero score [U] for each product. Follow the order of rarity in row 31 for descending _ . for example,

for subgroup 123, use: J + [G – G29]/2 + [F – F29]/3 + K/6 = U; which results in: 2.9 + 1.3 + .55 x 1.3 = 6.2.

for subgroup 156, the formula would be J + [B – B29]/2 + [E – E29]/3 x .84 = 3.2

A separate BI report was used as input info to separate Trifacta and Paxata distances, which were both 6.2. Trifacta was moved up above Paxata.

Step 13: Find more accurate vectors. Use M and N to calculate position rules in a 360 vector, logging the new azimuths into column T. Using ‘pull’ rules do not [always] result in enough separation between nodes for sketch visualizations so an additional ‘push’ rule was applied here. By looking at the relationship between H and other pertinent segments, it is simple to construct a rule for pushing Informatica and Datameer closer toward the Integration vector; and pushing Cambridge Semantics away: thus H becomes the azimuth push vector for G. That split up twins at row 12 and 13 that still had totally identical positions.

Step 14: Feature selection [fs1]:

Attributes from segments that are not informative can be deleted from the sketch. The deployment segment E is not especially informative anymore, nor are Sources [A], Curation [B], or Exploration [C] so they can be deleted from the sketch. This means that Microstrategy and Tableau will be lost, and the remaining four informative segments on the sketch surface can be fit into 180 degrees worth of space, and clocked to be best fit for purpose.

The four remaining segments taken from the 2016 Prep meatball chart, map to three segments in another source, Forrester’s 2016 Fabric Wave. In the Fabric Wave, both transformation and integration are classified as ‘Orchestration,’ so for the new 180 degree sketch, segments D and H are squeezed together into a single segment with a 60 degree range, giving more room to spread out the cluster in the ML segment. The result is three 60 degree segments, one for Orchestration [0 – 60], one for ML [60 – 120], and one for Cataloging and Metadata [120 – 180]. The Exploration segment is off the screen, below the Metadata segment.

Remember that the goal for this type of analysis is to reach the center, where segments converge. In this 180 degree sketch half of the overall space is hidden, so the ‘center’ is now on one edge of the screen, [halfway on the x axis].

2016-prep4di-180

The Valve System

I recently read an old copy of Valve’s employee handbook, which contains a fascinating philosophy on organizational productivity. The following is a summary of the key points.

Employees are self directed. Each one figures out what the company’s customer wants, and gives it to them.

Employees can work on any project of their choice. It can be their own project, and they can recruit others to work on their project. Teams form organically. This means they must constantly evaluate which are the most valuable things to work on.

It also means employees need to leverage their individual strengths, taking care to not find themselves putting out fires all the time. In other words remain proactive about the company’s long term goals, and choosing the most important work.

It is best to take actions that can be measured, where outcomes can be predicted, and results can be analyzed.

Employee performance is evaluated by their peers, and against their peers, with the latter referred to as stack ranking. Peer reviews provide feedback to the employee. Payroll compensation is set and adjusted through the stack ranking technique where project groups rank their own members based on:

—  Difficulty of problems solved, and uniqueness of capability within the company;

—  Productivity [Working repeated long hours over extended periods is seen as a sign of poor planning. Crunch mode is not really approved of];

—  Level of contribution to the group and product, including accuracy of decisions, or effectiveness as a tester. Mistakes are not penalized, but repeating the same mistake over and over is penalized.

Hiring is the most important action in the company. Only hire the best, ideally T shaped people, with balanced emphasis on the horizontal. Capabilities in high bandwidth collaboration for example.

Part of my horizontal, in case you were wondering, is project information stewardship. Knowledge on the mechanics of projects; awareness of the code-writing part of making software; and locations of resources including legal, financial, and psychological.

About the Author.

Personnel is the most important component of a retailing mix

aiim-2016

Personnel is the most important component of a retailing mix. More important than the product, price, place, promotion, or presentation.

Maybe if the ‘people’ part of the marketing mix were visibly, consistently effective, it would be more obvious to the world that personnel is the most significant element of any b2c experience. Any business should be able to tell you that the human resource is their most valuable asset, so how could ‘people’ not be the most important component of the retailing mix? People are the most important.

Why is the human resource the most valuable asset? Only through the salesperson, account manager, or support rep can the company establish the ‘goodwill’ with the customer that results in the customer’s return. Since returning an existing customer is 6-10 times more cost effective than generating a new customer, the chain reaction can be seen all the way from the first encounter with the customer, right down to the bottom line of the balance statement.

Keeping personnel trained on how to be consistently effective is management’s job. The marketing department should integrate with the human resources department when developing the sales training programs.  “An effective IMC [integrated marketing communications] plan consists of building bridges with other internal departments so that everyone is aware of the thrust and theme of the program. Satisfied and positive employees are more likely to help the firm promote its image.” That statement could be augmented to say that the positive employee will definitely promote the firm’s image.

2017 article in Cox Business news concurs: http://www.coxblue.com/new-retail-paradigm-its-a-people-not-a-product-business-now/

References:

Ebren, Dr. Flgen. Impact of Integrated Marketing Communications programs Enhancing Manager and Employee Performance. PDFCAST.org. Retrieved October 14, 2010. pdfcast.org/pdf/impact-of-integrated-marketing-communications-programs-in-enhancing-manager-and-employee-performance

Marketing Teacher. People and Services Marketing. Retrieved October 14, 2010. www.marketingteacher.com/lesson-store/lesson-people.html

Wikipedia. Goodwill (accounting). Retrieved October 14, 2010. http://en.wikipedia.org/wiki/Goodwill_(accounting)

SQL Layers

Stinger, Panthera, Impala, Drill, Pivotal

Stinger, Panthera, Impala, Drill, Pivotal

Hortonworks is 100% OS. Deep integration. Analytical DBMS, in memo, 3 for Hadoop, stream; all just like Cloudera but Horton offer a free desktop Hadoop sandbox. Slower release schedules choosing to only release when products are fully vetted by community. Fewer bugs, lower risk from conservative approach, but does have a couple of drawbacks …

Strategic partnerships with Teradata, SAP, Red Hat, Rackspace and Microsoft. Customers include AT&T, Bloomberg, and Cardinal Health.

MapR holds the third place slot in market share behind Cloudera and Hortonworks. Gives speed and reliability to the slow Hadoop. Not totally OS but some balance of proprietary and OS which actually provides some advantages, in the form of readymade capabilities that are somewhat lacking in Hortonworks and Cloudera. These include an optimized metadata management feature with strong distributed performance and protection from single point of failure; full support for random write processing; and a stable, node based job management system (MapR 2014).

MapR offers three distributions and a long list of integrations for big data applications including Hive, Stinger, Tez, Drill, Impala and Shark for SQL access; and Pig, Oozie, Storm, Zookeeper, Sqoop, Whirr, Spark, Flume and Mahout for just about any other capability users require.

EMC Pivotal is a product line resulting from the combination of VMware Cetas, Cloud Foundry, Gemfire, EMC Greenplum and Pivotal Labs plus $100 million investment from General Electric. Pivotal’s analysis technique moves the processing to the DB, similar to SAS and Alteryx.

Greenplum is an analytic DB that is part of the EMC Pivotal product line. Pivotal HD Community Edition provides integration of HBase, HDFS, Hive, MapReduce, SAS and Zookeeper. Two other Pivotal products Gemfire and Hawq (an acronym for Hadoop with query) combine with HD to form an in memory analysis Hadoop distribution.

3 Clarifications on Yarn, Hadoop 2.0, HDFS

Distributed File Storage: in the well known distributed file storage systems, multi-structured (object) datasets are divided into blocks and spread out over a cluster or clusters of servers for the purpose of parallel batch sequential write once operations. Any type of data and many sizes of files can be handled without formal extract, transformation and load conversions, with some technologies performing markedly better for large file sizes.

Distributed Computing: the popular framework for distributed computing consists of a storage layer and processing layer combination that implement a multiple class algorithm programming model. Low cost servers support the distributed file system that stores the data, dramatically lowering the storage costs of computing on a large scale of data that would be involved in web indexing for example. MapReduce is the default processing component. Processing results are typically then loaded into an analysis environment.

The use of inexpensive servers is appropriate for slower, batch-speed big data applications, but do not provide good performance for applications requiring low latency processing. The use of basic MapReduce for processing places limitations on updating or iterative access to the data during computation as well; BSP systems or newer MapReduce developments can be used when repeated updating is a requirement. Improvements and “generalizations” of MapReduce have been developed that provide additional functions lacking in the older technology, including fault tolerance, iteration flexibility, elimination of middle layer, and ease of query.

Resource Negotiation: the common distributed computing system has little in the way of data management capabilities built in. Several technologies have been developed to provide the necessary support functions, including operations management, workflow integration, security, and governance; but of special importance to the development of resource management, are new features for supporting additional processing models other than MapReduce and controls for multi-tenant environments and higher availability and lower latency applications.

In a typical implementation, the resource manager is the hub for several node managers. The client or user, accesses the resource manager which in turn launches a request to an application master within one or many node managers. A second client may also launch his / her own requests which will be given to other application masters within the same or other node managers. Tasks are assigned a priority value and allocated based on available CPU and memory, and the nodes provide the processing resource.

Data movement is normally handled by transfer and API technologies other than the resource manager. In rare cases, peer to peer (P2P) communications protocols can also propagate or migrate files across networks at scale, meaning that technically these P2P networks are also distributed file systems. The largest social networks, arguably some of the most dominant users of big data, move binary blobs of over 1GB in size internally over large numbers of computers via such technologies. The internal use case has been extended to private file synchronization; where the technology permits automatic updates to local folders whenever two end users are linked through the system.

In external use cases, each end of the P2P system contributes bandwidth to the data movement, making this currently the fastest way to leverage documents to the largest number of concurrent users. NASA for example uses this technology to make 3GB images available to the public, however any large chunk of data such as video or scientific data can be quickly distributed with lower bandwidth cost.

35 Important eDiscovery Vendors

eDiscovery

eDiscovery Software Vendors Use Case Pricing / Business Model Overall Rating Data Coverage
AccessData ECA, EDA processing; and attorney review for LE. na 500 employees. High execution. na
Alphalit ECA, EDA for LE. na na na
Catalyst All right hand EDRM functions, relies on other vendors for left hand. na Large, 170 employees. Predictive coding in Asian language.
Clearwell Systems (Acquired by Symantec for $400M, 2011) All phases of EDRM, offered as an appliance. On almost all shortlists and is identified by many of its competitors as among their top three threats. All inclusive pricing. available through a network of partners and via legal services providers as hosted SaaS. Chosen by many for its ease of use and product features. na
CommVault Publicly traded backup vendor. Differentiates in deduplication and governance. na na
Daegis: DAEG Edge product is designed to be an SS cloud offering. Includes remote ingest to load and process custodian data, export of productions. na ECA, EDA for LE. Acumen tech assisted review feature is fully embedded in Edge Review workflow.
Datassimilate Powersearch Specializes in lower cost ECA/EDA. na na na
Digital Warroom Small scale, low cost. $895 to $11,895 to start up. na na
Driven Right hand EDRM. Competitive No differentiation in a crowded market. na
Everlaw (previously EasyESI) Doc review. Priced per Gigabyte. na Seeded with $850k. Based out of Berkeley, CA.
eBrevia Core competencies in system architecture and design. na na na
EMC Kazeon eDiscovery Large capacity appliance. na 500 customers. ECA, EDA, ID, collection and processing for LE.
Epiq In addition to offering its own proprietary tech, Epiq offers third party software to perform application and data hosting for clients that use other products. Epiq provides hosted, on-site and managed services. na Expanded coverage that suits law firms particularly well, as they tend to use different products for different cases, or are asked by their clients to use particular products.
Exterro ECA, EDA for LE. The company has expanded its offerings to include more EDRM functions, now extending from identification to production. Fusion software suite is the only product built on a general-purpose workflow engine and integration hub. na Can handle the e-discovery process across departmental and complex org structures, and serve as a platform for integrating IT systems like archiving, content mgmt, enterprise legal matters, HR, and cloud based storage; as well as other eDisco applications.
FTI Tech Right hand ECA, EDA for LE advanced users. Saas and on premise. Both on per user. Global; large and complex. Biggest threat to FTIs is their clients’ desire to move the upfront part of eDiscovery process in house.
Guidance Software: GUID ECA, EDA processing for LE. $3,000 plus maintenance, Five star. Acquired CaseCentral, now called EnCase eDiscovery Review.
HP Autonomy Hybrid search vendor infrastructure software. Streamlining to high margin opportunities in e-disco (/legal content mgmt), compliance, and info governance. Cloud, on-premise, hybrid, and appliances. Language independent, conceptual tech. IDOL uses complex pattern-matching algorithms and probablistic modeling to form a conceptual and contextual understanding of content.
IBM OmniFind, Stored IQ Powerful but complex. Long history of use in law enforcement and healthcare. TCO can be high. StoredIQ for LE ECA/EDA.
Integron Privately owned niche player with small ability to execute. na na Full spectrum.
IPRO Tech ECA, EDA processing. na na na
kCura Relativity Dual strategy; hosted (service) or on premise (software). Assisted review (trained on relevance by sampled subset coding). 100 direct license law firms, 19 corporations, 16 gov agencies. Often offered by other vendors in addition to their own software. Competitive. 3 year terms. Software is on per user basis, extra charge for analytics, or processing. Training is $500 / day. Widely used; grown from 14K users in 2009 to 89K in 2014. From 57 employees to 347. Wide.
KPMG Staff is knowledgeable in technology assisted review. Narrow client base. na na May not keep up in a fast moving market.
Kroll Ontrack (Acquired by Altegrity, 2010) Appears on many shortlists. Like FTI, KO is a provider that customers turn to for both software and services. Large corp. customer base. Attempting to make pricing and provisioning strategies more transparent. na Made necessary business model changes toward a strong vision of where it wants to go in the market.
LexisNexis Research, litigation, risk solutions. Product line includes Concordance, Law PreDiscovery, Lexis.com, Advance, 360, Dossier, Subscription for 1 to 10 attorneys, $25 / month. 11 to 20 attorneys, $125. Search charges can be as high as $324. Access is $50 or less. Alerts are $16 / day and up. Docs are 20 to $80 per link. Reports are 12 to $75 each. Images can be $1500. Attachments can be $225. D&B reports up to $629. Dossiers $150. Assurance reports range from $1 to $1,000. environmental gateway reports 99 to $230. Lack of functionality and differentiated product strategy. Wide. Relavint does link visualization. LPD does ECA, EDA processing.
Mindseye Closest competitor to Vound in ECA, EDA processing. Lower Smaller, younger company. na
Nuix ECA, EDA processing. Based in Australia. $3.500 plus 25% annual maintenance. na Enables early assessment of of data for any given matter.
Orange Legal Specializes in lower cost ECA/EDA. na na na
Planet Data ECA, EDA for LE. na na na
Recommind Axelerate 5 Full-spectrum EDRM vendor; also has an enterprise search product. Key option for predictive coding. All inclusive pricing. Expensive. Web technology; zero installation. 103 docs reviewed per hr in live tests. 90% average reduction of post culled review set (w predictive coding). Federated search (I, E), categorization, processing, culling, redaction, analysis, review.
Stroz Frieberg A battleship: extraction, predictive coding, language support, visual mapping,  culling. The majority of SFs clients are law firms. na Few weaknesses. Features the market expects, but not visionary. Partners with Pangea3 to provide doc review services.
Thompson Westlaw Online legal research; not eDiscovery. $300 / person. Multistep searching. Case law, statutes, codes, publications, records, law journals, law reviews. 40K DBs.
Ubic Lit I View Asian centric. na Handles audio and video files in any language.
Vound Intella DIY forensic search. Now expanding to larger businesses and web based offering (Connect). Connect is not loaded onto customer’s machine; data is kept on clients’ own server. Licensed for $25K. Some attorney review features. $895 (very low) searches 10GB ESI; 100GB is $2.7K, 250GB is $4K. Standalone version is $6K. Multi user is $11K. All one yr; aditional yrs are 20% of purchase price. Easy to use. ECA/EDA incl. Advanced clustering visualization. V1.8 to have predictive capability.
Xerox Litigation Services Entered the segment in 2006 while acquiring about seven eDiscovery related companies,  the latest, Lateral Data, 2012. na na ECA, EDA for LE.
ZyLAB Serious functionality. Full spectrum including audio. First eDiscovery provider with visualization classification. na Predictive coding based on semantic capability. Recognizes pictures.