# Predictive Analytics ### DO YOU HAVE CONFIDENCE IN YOUR PREDICTIONS?

We’ve all been there. We make a numerical prediction and then we ask ourselves, Is my prediction correct? The answer to this question depends on many factors. What is this prediction going to be used for? Does it matter if I am not 100% “correct”? What happens if I am 20% or 30% off? How will my prediction be used and by whom? Am I modeling an inherently nonlinear or stochastic system, where small variations in input parameters lead to vastly different answers? Does my input or comparison data come from heterogeneous sources with different levels of fidelity or uncertainty? The answers to these questions are all important in deciding what tools to use to determine the goodness of a model.

Our Predictive Analytics solution is composed of three element:

1. MODEL VALIDATION: Addresses the question: Is my a model statistically sound given a confidence level?
2. UNCERTAINTY QUANTIFICATION & PROPAGATION: Incorporates the effects of uncertainty into the prediction
3. INFORMATION FUSION: Incorporates the effect of having information data sources with different levels of fidelity, sparsity, etc.  into the prediction.

### MODEL VALIDATION  Let’s introduce two important concepts when validating a model:

• Statistical Significance: Answers the question, Are two or more data sets statistically similar for a given significance level? Let’s take the graph to the right, the read curve is a numerical prediction and the other curves are several experiments. As seen from the graph, the red curve follows the same trend but is generally outside the “cloud” of experimental values. If we ran a statistical test, we would probably determine that this prediction does not “match” the experiments, because it is outside a certain variance band. Whether something is statistically significant depends on the power of the test and the level of significance error we are willing to accept. Generally, the power of the test increases the lower our data variance is and the greater amount of data we have.
• Practical Significance: On the other hand, answers the question, Does it matter? Looking at the same graph on the right, one may envision a use case, where we are only interested that the red curve follows the experimental “cloud” but we don’t care that whether it is with in it. For this use case, even though we would say that the two data sets are statistically significantly different, from a practical significance perspective they are not.

To determine whether two data sets the candidate (typically the numerical prediction and the reference (typically experiments or field data) are statistically the same, we use regression analysis and hypothesis testing. Our advanced regression analysis techniques using an optimized basis or shape function set (polynomial, exponential, sinusoidal) extracts the best fit signature of the candidate and reference data set. We then employ advanced univariate and multivariate statistical methods for hypothesis testing to compare the candidate and reference sets at a given confidence level. There are three cases one can envision when comparing numerical predictions against experiments or field data:

• One-to-One: You have one prediction curve that you are trying to compare against one experiment or field data curve
• One-to-Many: You have one prediction curve that you are trying to compare against several experiments or field data curves.
• Many-to-Many: You have many prediction curves that you are trying to compare against several experiments or field data data curves.

For each of these cases, we offer advanced model validation techniques to determine whether your model is valid. To learn more, download our joint presentation with Caterpillar from the ASME Verification & Validation Symposium, May 2016

###  UNCERTAINTY QUANTIFICATION & PROPAGATION

Here we’re quantifying the uncertainty of parameters and propagating this uncertainty into the prediction. There are two types of uncertainty:

Aleatory Uncertainty, which is the natural variation of input parameters that impact outputs of interest. Because this variability is inherent to the physical phenomena being modeled, it is irreducible.

Epistemic Uncertainty on the other hand is the uncertainty due to the lack of knowledge, which means that it can be reduced by additional information. This uncertainty can be further separated into model uncertainty (e.g. parameters uncertainty, approximation errors, etc.) and data uncertainty (e.g. measurement uncertainty, sparse data, etc.).

As much as possible, a good modeler tries to reduce epistemic uncertainty and characterize both types of uncertainties. We then follow the following process:

1. Uncertainty Characterization – Characterize the system uncertainty (typically probabilistic),
2. Model Verification – Verify that the model is implemented correctly and the solution verified,
3. Model Calibration – Update model based on verification
4. Model Validation – Determines whether model results are valid
5. Uncertainty Propagation for Prediction – Uncertainty propagated in model

### INFORMATION FUSION  Many times when doing a prediction one needs to combine several data sources like simulation data, test data, expert opinion, other mathematical models, field data, legacy data, reliability data etc. Each of these heterogeneous data types have different levels of uncertainty and knowledge associated with them.

A methodology to take this type of problem is called Bayesian Information Fusion or Bayesian Uncertainty Integration. This method uses Bayesian Networks to propagate uncertainty through the network nodes. The process involves four steps:

1. Model Calibration: Involves developing the Likelihood Function, Updating Distribution Fuction, Posterior PDF Construction and Gaussian Process Surrogate Model
2. Model Validation: Involves Bayesian Hypothesis Testing, Model Reliability Metric and Model Error Quantification.
3. Integration of Calibration & Validation: Involves Weighted average of model parameters and propagating calibration uncertainty.
4. Improved Prediction: Involves propagating integrated distributions and computing predictions with quantified uncertainty.