Prioritizing and annotating the dimensions of input data is part of the optional guidance step for each analysis, as introduced in Working with Inspirient > Providing Guidance. For both dimension priorities and annotations, suggested values are provided by the system based on its current understanding of the dataset. In most cases, users only need to review and possibly tweak these suggestions. If an analysis is started via the I’m feeling lucky button, all suggestions are used directly without reconfirming them with the user.

Prioritization of Input Dimensions

Prioritization affects the sorting order of results, so that results that are derived from higher-priority dimensions are more prominently displayed among results and more likely to be included in stories.

The exact effects of low / high priorities are as follows, with the objective of ensuring that overall comprehensiveness across all analytical methods degrades gracefully for very large datasets:

  • Dimensions set to the lowest priority setting are ignored during the analysis
  • Dimensions set to the highest priority setting are analysed preferentially, i.e., they are guaranteed to be evaluated by all applicable analytical methods
  • Other low-priority dimensions may be omitted from certain analytical methods if otherwise computational requirements would exceed allocated resources.

Client-specific and hardware-specific options may be set by system administrators to fine-tune system behavior.

Analysis Guidance - Dimension priority and context dialogs
Analysis Guidance – Dimension priority and context dialogs

Contextualization of Input Dimensions

Annotations allow users to establish the analytical context of input dimensions, for example, to specify whether calculating the sum over a column of numeric values is sensible (e.g., for inventory quantities) or not (e.g., for time-series measurements).

Effects of annotations are additive if multiple annotation are applied to the same column of a datatset. In case the effects of multiple annotations are mutually exclusive, the annotation that is applied later overrules any preceding annotations.

There are four kinds of data annotations:

  • Filter – Annotations that perform various filters on the input data
  • Transformation – Annotations used to carry out a transformation on the input values
  • Semantic – Annotations to explicitly communicate column meaning
  • Analysis – Annotations that affect the analysis

Some annotations also have a general negated version that causes the stated effect of the annotation not to be applied or considered when evaluating the dataset. For the negated version, the prefix NOT_ (including the trailing underscore character) is prepended to the regular annotation. For example, the annotation SUMMABLE becomes NOT_SUMMABLE in its negated form. Annotations that support this form of general negation are marked with (N) in their type fields below.

The full list of supported annotations and their effects are listed in the following tables.

Annotations for Basic Data Properties
Annotation Type Description Examples
DEFAULT_VALUE Transformation Specifies a default value for this column that is used whenever a value is not given, i.e., it is absent or given as ‘null’.
  • DEFAULT_VALUE("No answer given") applied to a column of opinion polling answers inserts the phrase “No answer given” into all empty cells, to be used in any output of results
IGNORE_VALUE Transformation Ignore specified value(s), i.e., treat as absent or null
  • IGNORE_VALUE("John Doe") applied to a column of names causes all cells that contain the string “John Doe” to be treated as empty
MORE_IS_BETTER Semantic Specifies that higher values in a numeric columns are ‘better’ in the context of this analysis, i.e., for any optimization sets the aim to achieve higher values for this dimension (as opposed to lower values). See also LESS_IS_BETTER.
  • MORE_IS_BETTER applied to a column of prices in the context of selling goods ensures that higher prices are preferred
  • MORE_IS_BETTER applied to Likert-scale survey responses ensures that the higher Likert-scale values are aggregated into the Top 2 / Top 3 bucket (and the lower values into the Bottom 2 / Bottom 3 bucket). This is also the default behavior
LESS_IS_BETTER Semantic Specifies that lower values in a numeric column are ‘better’ in the context of this analysis, i.e., for any optimization sets the aim to achieve lower values for this dimension (as opposed to higher values). See also MORE_IS_BETTER.
  • LESS_IS_BETTER applied to a column of prices in the context of buying goods ensures that lower prices are preferred
  • LESS_IS_BETTER applied to Likert-scale survey responses ensures that the lower Likert-scale values are aggregated into the Top 2 / Top 3 bucket (and the higher values into the Bottom 2 / Bottom 3 bucket)
HAS_SUBTOTALS Semantic Specifies that a numeric column contains subtotals, which implies that certain analyses are not permissible for this column, e.g., calculating sums and averages over all values (or even selections of values that still include the subtotals).
  • HAS_SUBTOTALS applied to a revenue column on a business report that also aggregates revenues by business unit or geographies ensures that no sums/averages are incorrectly calculated that include these subtotals
ID Semantic Specifies that numeric values in a column represent ID values that can be used for aggregating other numeric values or joining data from another table, but should not be summed or averaged.
  • ID applied to a column that contains five-digit postcodes (as used in Germany) ensures that the analysis results may contain aggregations by postcode (as opposed to averaged or summed postcodes)
NATURAL_LANGUAGE_TEXT Semantic Text values in this column should be treated as natural language text for semantic text analysis.
  • NATURAL_LANGUAGE_TEXT applied to a column of open answers in a survey table will ensure that key
Annotations for Basic Statistical Properties
Annotation Type Description Examples
NOMINAL Semantic (N) Ensures numeric values in this column are treated as (unordered) categorical items, i.e., they will only be counted when aggregated and not summed or averaged. See also ORDINAL.
  • NOMINAL applied to a column of numerical product IDs in a table of sales transactions ensures that other numerical columns (e.g., sales revenues) are aggregated by product ID for analysis
  • NOT_NOMINAL applied to a column of given names ensures that no aggregations are calculated using these given names, even if they would otherwise satisfy all prerequisites
CATEGORICAL Semantic Synonymous with NOMINAL. (as above)
ORDINAL Semantic (N) Ensures values in this column are treated as ordered categorical items, i.e.., allowing them to be counted in aggregations and sorted in the correct order. See also NOMINAL.
  • ORDINAL applied to a column of numerical priorities in a table of tasks ensures that other numerical columns (e.g., estimated cost) are aggregated and sorted by priority level for analysis
  • NOT_ORDINAL applied to a column of price points in a table of commercial items ensures that other numerical columns (e.g., estimated cost) are not aggregated and sorted by these prices for analysis
SUMMABLE Semantic (N) Specifies that it is OK to sum the numerical values in this column.
  • SUMMABLE applied to a column of prices in a table of sales transactions ensures that the prices are summed to get total revenue numbers
  • NOT_SUMMABLE applied to a column of prices in a table of inventory / SKU items ensures that the prices are not summed
MAXIMIZABLE Semantic Specifies that the values in a numeric column have a meaningful maximum which should be considered in all analyses. See also MINIMIZABLE.
  • MAXIMIZABLE applied to a column of readings of a temperature sensor ensures that the maximum temperature is calculated and considered (whereas this would not be desirable for a sensor that captures geographical longitude)
MINIMIZABLE Semantic Specifies that the values in a numeric column have a meaningful minimum which should be considered in all analyses. See also MAXIMIZABLE.
  • MINIMIZABLE applied to a column of readings of a temperature sensor ensures that the minimum temperature is calculated and considered (whereas this would not be desirable for a sensor that captures geographical longitude)
Annotations for Advanced Statistical Properties
Annotation Type Description Examples
DEPENDENT_VARIABLE Analysis Specifies that this column is a dependent variable of interest that should be explained given information in other independent variables. See also INDEPENDENT_VARIABLE.
  • DEPENDENT_VARIABLE applied to a column of sales revenues results in explanations of sales revenue values in light of changes in other columns, e.g., unit prices
INDEPENDENT_VARIABLE Analysis Specifies that this column is a control variable that can be used to explain effects on the dependent variable(s) in the same table. See also DEPENDENT_VARIABLE.
  • INDEPENDENT_VARIABLE applied to a column of sales revenues results in explanations of other columns, e.g., profit, to be calculated in light of the given sales numbers
REFERENCE_CATEGORY Semantic For multivariate regression analysis, this annotation specifies the category value (as the parameter of this annotation) that is used as the baseline reference to compare against all other categorical values of a column.
  • REFERENCE_CATEGORY("New York") applied to a column of cities makes New York the reference category that all other cities are compared against in multivariate regression
WEIGHTING_FACTOR Analysis In survey analytics, this annotation specifies that the values in this dimension should be used to weight aggregations, typically with the objective of compensating for bias in the collected data.
  • WEIGHTING_FACTOR applied to a column of numeric data causes all aggregations to be weighted according to the respective value in this column
Annotations for Data Processing
Annotation Type Description Examples
ABC_CLASSIFICATION Transformation Classify column into n categories using ABC analysis.
FILTER_ON Filter Filter table so that only rows are processed that match a given criteria. Criteria may be specified as values for direct comparison or as regular expression for text, or comparison operators for numbers. Filters applied to multiple columns are additive, just like drop-down filters, e.g., from Microsoft Excel. Active filters are noted in the Analysis Context of each result they affect.
  • FILTER_ON("Germany") applied to a column of country names focuses the analysis of the containing table only on Germany-related rows
  • FILTER_ON(>=500) applied to a column of numeric values, such as “Total sales”, focuses the analysis of the containing table on rows with a sales value greater than or equal to 500. Applicable operators are: greater than ‘>’, greater than or equal to ‘>=’, less than ‘<’, less than or equal to ‘<=’, and equals ‘==’
  • FILTER_ON("Germany|United Kingdom") focuses the analysis of the containing table only on rows related to either Germany OR United Kingdom (supporting Perl Compatible Regular Expressions)
FILTER_ON_DOMINANT_DOMAIN Filter Filter table on the most frequent items within a column, i.e., those categorical items that in combination account for at least 80% of the contents of the column. This annotation is used to focus the analysis on the most relevant categories, and omit noise created by a “long tail”.
  • FILTER_ON_DOMINANT_DOMAIN applied to a column of transaction types results in only the rows of main transaction types (like ‘deposit’ and ‘withdrawal’) being considered in the analysis, while rows with other, less frequent transaction types (e.g., rebookings) are ignored
FILTER_ON_DOMINANT_DOMAIN_BY_VALUE Filter Filter table on items of a numerical column which together, with smallest count of items, amount to at least 80% of the total accumulated value of a column. Not defined if column contains both positive and negative values.
  • FILTER_ON_DOMINANT_DOMAIN_BY_VALUE("Revenue") applied to the revenue column in a table of sales transactions focuses the analysis on only those transactions that individually contributed significantly to overall revenue
FILTER_ON_TOP_3_BY_VALUE Filter Filter table on the three items with the largest sum of a given numeric value column.
  • FILTER_ON_TOP_3_BY_VALUE("Price") applied to a price column in a table of sales transactions focuses the analysis on the three highest priced items
FILTER_ON_TOP_10_BY_VALUE Filter Filter table on the ten items with the largest sum of a given value column.
  • FILTER_ON_TOP_10_BY_VALUE("Price") applied to a price column in a table of sales transactions focuses the analysis on the ten highest priced items
DRILL_DOWN Transformation Split table into separate tables for each unique value in this column. Each of these tables is then analyzed separately. The drill-down value is specified in the Analysis Context of each result. Use with caution, as this annotation will very likely lead to a much larger number of results. See also DRILL_DOWN_ON_DOMINANT_ DOMAIN.
  • DRILL_DOWN applied to a column of country names results in separate analyses for each country. Each result is clearly labeled with regard to which country it covers
DRILL_DOWN_ON_DOMINANT_DOMAIN Transformation Split table into separate tables for the most frequent items in this column, i.e., those items that collectively account for 80% of all data in this column. Each of these tables is then analyzed separately. The drill-down value is specified in the Analysis Context of each result. Use with caution, as this annotation will likely lead to an increase in the number of results. See also DRILL_DOWN.
  • DRILL_DOWN_ON_DOMINANT_DOMAIN applied to a column of country names results in separate analyses for each of the most frequent countries. Each result is clearly labeled with regard to which country it covers
JOINABLE_ID_VALUES Transformation Join this table and a second table that is also being submitted for analysis using the values in this column as primary / foreign keys. The column with unique values is treated as primary key. If both columns have non-unique, i.e., repeating, values, no action is taken.
  • JOINABLE_ID_VALUES applied to a column of client IDs in a table of sales transactions (foreign key) and to another column of client IDs in a client directory (primary key) joins the data from the client directory to the table of sales transactions. This combined table is then analyzed
JOINABLE Transformation Synonymous with JOINABLE_ID_VALUES. (as above)
USE_AS_IS Transformation Disables any automated transformations during analysis for this column, e.g., any automated attempts of cleaning up erratic and/or outlier values or of converting the data in this column to a more suitable data type for analysis.
  • USE_AS_IS applied to a column of predominantly numeric values with very few interspersed “TBD” values will leave these “TBD” values in place. As a result, no statistical analyses like sums or averages will be performed on this column (which would otherwise be the default behavior)
  • USE_AS_IS applied to a column of regular calendar dates with a few spurious outlier dates, e.g., 01/01/1700, will keep these outlier dates and use them as part of any time series analysis (default behavior is to exclude outlier dates from time series analysis)
Annotations for Process Data
Annotation Type Description Examples
PROCESS_VARIABLE Semantic For process data, specifies that a column contains process-level data, i.e., data relating to the abstract definition of a business process.
  • PROCESS_VARIABLE applied to a dimension in a process dataset, e.g., the global process owner, ensures that values in this dimension are analyzed at the abstract process level (as opposed to the instance or event level).
PROCESS_ID Semantic For process data, specifies that a column contains ID values that correspond to a uniquely identifiable, abstract business process.
  • PROCESS_ID applied to a dimension of a process dataset ensures that the values in this dimension are used as the unique process identifier.
INSTANCE_VARIABLE Semantic For process data, specifies that a column contains instance-level data, i.e., data relating to a specific execution instance of a business process.
  • INSTANCE_VARIABLE applied to a dimension in a process dataset, e.g., the client company in an invoice-to-cash process, ensures that this dimension is analyzed at the instance-specific level (as opposed to the process or event level).
INSTANCE_ID Semantic For process data, specifies that a column contains ID values that correspond to a uniquely identifiable, specific execution instance of a business process.
  • INSTANCE_ID applied to a dimension of a process dataset ensures that values in this dimension are used as the unique instance identifier, i.e., as the identifier of a specific instance of the abstract process.
EVENT_VARIABLE Semantic For process data, specifies that a column contains event-level data, i.e., data relating to a specific event as part of the execution instance of a business process.
  • EVENT_VARIABLE applied to a dimension in a process dataset, e.g., the responsible for the current event / process step in an invoice-to-cash process, ensures that this dimension is analyzed at the event level (as opposed to the process or instance level).
EVENT_ID Semantic For process data, specifies that a column contains ID values that correspond to a uniquely identifiable, specific event as part of an execution instance of a business process. See also EVENT_CATEGORY.
  • EVENT_ID applied to a dimension of a process dataset ensures that values in this dimension are used as the unique event identifier.
NEXT_EVENT_ID Semantic For process data, specifies that a column contains ID values that correspond to the subsequent event within the same instance along a process. See also EVENT_ID.
  • NEXT_EVENT_ID applied to a dimension of a process dataset ensures that values in this dimension are used as referencing to the EVENT_ID identifier value of the next event within a specific instance.
PREVIOUS_EVENT_ID Semantic For process data, specifies that a column contains ID values that correspond to the prior event within the same instance along a process. See also EVENT_ID.
  • PREVIOUS_EVENT_ID applied to a dimension of a process dataset ensures that values in this dimension are used as referencing to the EVENT_ID identifier value of the previous event within a specific instance.
EVENT_CATEGORY Semantic For process data, specifies that a column contains categories that allow classifying and aggregating per-event process data. See also EVENT_ID.
  • EVENT_CATEGORY applied to a dimension of a process dataset ensures that values in this dimension are used to categorize the event, e.g., numerically or with an explicit event category such as ‘invoice sent’. Event properties are then aggregated along these categories.
EVENT_START_TIMESTAMP Semantic For process data, specifies that a column contains the start time of an event. See also EVENT_END_TIMESTAMP and EVENT_DURATION.
  • EVENT_START_TIMESTAMP applied to a dimension of a process dataset ensures that date/time values in this dimension are interpreted as the start time of the event, e.g., to calculate event duration if not otherwise given.
EVENT_END_ TIMESTAMP Semantic For process data, specifies that a column contains the end time of an event. See also EVENT_START_TIMESTAMP and EVENT_DURATION.
  • EVENT_END_TIMESTAMP applied to a dimension of a process dataset ensures that date/time values in this dimension are interpreted as the end time of the event, e.g., to calculate event duration if not otherwise given.
EVENT_DURATION Semantic For process data, specifies that a column contains the duration of an event. See also EVENT_START_TIMESTAMP and EVENT_END_TIMESTAMP.
  • EVENT_DURATION applied to a dimension of a process dataset ensures that numeric values in this dimension are interpreted as the duration of the event, e.g., to calculate overall and aggregate processing times for the process.
EVENT_RESULT Semantic For process data, specifies that a column describes the result state of an event.
  • EVENT_RESULT applied to a dimension of a process dataset ensures that the values are interpreted to signify the end state of the event, e.g., success or failure, to analyze a process for likely sources of stalling and failures.
EVENT_OWNER Semantic For process data, specifies that a column contains owner information relating to an event.
  • EVENT_OWNER applied to a dimension of a process dataset, e.g., the column with the responsible person, team or department, ensures that event/case/process metrics can be appropriately allocated within the organization.
Annotations for Surveys / Opinion Polling
Annotation Type Description Examples
DEFINE_AS_MISSING Transformation Specifies a value for this column that is to be treated as missing, i.e., no data available, during survey analysis.
  • DEFINE_AS_MISSING(-1, "Not applicable") applied to a column of opinion polling answers inserts the phrase “Not applicable” into all cells with a value of -1 and is treated as an empty value. The defined phrase is displayed appropriately in the analysis results
DEFINE_AS_NO_OPINION Transformation Specifies a value for this column that is to be treated as ‘no opinion’, i.e., respondent decided to provide no answer, during survey analysis.
  • DEFINE_AS_NO_OPINION(999, "Don’t Know") applied to a column of opinion polling answers inserts the phrase “Don’t Know” into all cells with a value of 999 and the numeric value 999 is excluded from sum and average operations. The defined phrase is displayed appropriately in the analysis results
DEMOGRAPHIC_VARIABLE Semantic For survey analytics, specifies that a column contains socio-demographic information which can be used in contingency tables and cross tabulations. Together with SURVEY_RESPONSE and SURVEY_META, this annotation classifies all dimensions contained in a typical survey dataset. See also SURVEY_RESPONSE and SURVEY_META.
  • DEMOGRAPHIC_VARIABLE applied to a dimension of household income in a survey ensures that this household income is analyzed as a socio-demographic variable (not a survey response)
MULTIPLE_RESPONSE_VARIABLE Semantic In survey analytics, if responses to multiple choice questions are stored across one dimension per response option, this annotation specifies which dimensions jointly encode a multiple choice variable. The annotation takes the name of a multiple choice variable (or any other string) as parameter, and this parameter needs to be identical across all dimensions belonging to the multiple choice variable.
  • MULTIPLE_RESPONSE_VARIABLE("Q1") applied to a number of columns in a survey dataset ensures that these columns are collectively interpreted and analyzed as the options of one multiple choice variable
MULTI_PUNCH_VARIABLE Semantic Synonymous with MULTIPLE_RESPONSE_VARIABLE (as above)
SURVEY_CASE_ID Semantic In survey analytics, specifies that this column stores ID values that uniquely identify the case / interview / questionnaire / respondent.
  • SURVEY_CASE_ID applied to a column in a survey dataset ensures that the values in this column are used when explaining any anomalies, outliers or other patterns in the data. If no case ID is specified, the line number is used instead
SURVEY_DURATION Semantic In survey analytics, this annotation specifies that this dimension holds the duration that a participant used to complete the interview or questionnaire. Adding this annotation enables speeder detection as part of the survey quality assessment.
  • SURVEY_DURATION applied to a numeric dimension causes this dimension to be treated as a survey duration indicator
SURVEY_INTERVIEWER Semantic In survey analytics, specifies that this column stores values that uniquely identify the interviewer who conducted the interview with a participant. Adding this annotation enables checks for interviewer bias as part of the survey quality assessment.
  • SURVEY_INTERVIEWER applied to a dimension causes that dimension to be treated as the interviewer ID for a survey dataset
SURVEY_META Semantic In survey analytics, this annotation should be applied to any dimension that holds meta-information about the survey to be analyzed, e.g., internal identifiers, organizational data or instructions to the interviewer or interview software. Together with DEMOGRAPHIC_VARIABLE and SURVEY_RESPONSE, this annotation classifies all dimensions contained in a typical survey dataset with the effect that no warnings are shown regarding any unclassified dimensions. See also DEMOGRAPHIC_VARIABLE and SURVEY_RESPONSE.
  • SURVEY_META applied to a dimension containing information about the web browser that was used to fill out a survey causes that dimension to be treated as survey meta-information, thus removing any warnings about this dimension being unclassified as neither socio-demographic nor response variable
SURVEY_MODE Semantic In survey analytics, specifies that this column stores values that uniquely identify the survey mode as part of which this case / interview was collected.
  • SURVEY_MODE applied to the column in a survey table that contains information specifying how an interview was conducted, e.g., in-person or online
SURVEY_RESPONSE Semantic In survey analytics, this annotation specifies that a dimension contains survey response values which can be used in contingency tables and cross tabulations. Together with DEMOGRAPHIC_VARIABLE and SURVEY_META, this annotation classifies all dimensions contained in a typical survey dataset. See also DEMOGRAPHIC_VARIABLE and SURVEY_META.
  • SURVEY_RESPONSE applied to a dimension of household income in a survey ensures that this household income is analyzed as a survey response (not a socio-demographic variable)
SURVEY_WAVE Semantic In survey analytics, specifies that this column stores values that uniquely identify the survey wave as part of which this case / interview was collected. Adding this annotation enables analysis of how responses have changed over time.
  • SURVEY_WAVE applied to the column in a survey table that contains the dates or other identifying information when a wave of interviews was run or which wave an interview belongs to. For waves identified by date values, this annotation results in time-series analyses over these dates
Miscellaneous Annotations
Annotation Type Description Examples
ANONYMIZE Transformation Anonymize all values in column with a securely generated ID value, utilizing a cryptographically strong one-way hash function. A look-up table to map hashed values to original values is made available separately to the user who owns the analysis.
  • ANONYMIZE applied to a column of employee names replaces the names with securely anonymized hash values in all analyses
OVERRIDE_RESTRICTIONS Analysis Annotates a dimension to be analyzed without any restrictions that would usually be in place to ensure acceptable runtime when analyzing very large tables. Use with caution!

These annotations are compliant with DIN SPEC 32792.

Advanced users may also prefer to embed these annotations directly in their data, by appending them to column labels enclosed in curly brackets, e.g., {SUMMABLE}.

Simple Analysis Guidance (Survey Analysis)

With the Analysis Guidance, as described above, users have great control over the performed analysis. This much control, on the other hand, may be time consuming and not necessary for some users.

Therefore Inspirient offers a more simple guidance additionally, which is tailored specifically to survey analysis. For most cases, that will give users enough control to guide the analysis with minimal time effort.

Initially the user will be shown a system-made classification of the input data’s columns. Afterwards the user may adjust the classification as pleased, by either using the multi-select or dropdown-menus and the according buttons.

The Simple Analysis Guidance (Survey Analysis) is automatically shown for all survey analyses. If more fine granular control is desired, a click to “Switch to advanced view” takes the user to the Analysis Guidance.

Simple Analysis Guidance (Survey Analysis)
The Simple Analysis Guidance (Survey Analysis) let's users guide the survey analysis with minimal effort

Best Practices

  • Prioritize sparingly, but with confidence – In most cases, it is not necessary to fine-tune the priorities of every dimension of a dataset. It’s more time-efficient to quickly adjust the priorities of the most important dimensions, and then later use tags to filter out less important results.
  • Annotate selectively – Annotations help the system to correctly handle the dimensions of a dataset in all corner cases. This means that in most cases the correct analytical methods will be applied, even without annotations. If pressed on time, some users may even do a quick initial run with the I’m feeling lucky button, check key results for issues, and add only annotations required to address these issues.
  • Re-use prior priorities and annotations – Priorities and annotations of all past analyses are scanned to make the best possible suggestion for the current dataset. This includes datasets from other users (with accounts on the same Inspirient service instance). Suggested priorities and annotations may thus reflect what your co-workers may find appropriate for your data at hand.