Category Archives: machine-learning

CRP = CR*P?

Procrastinating on a hard problem, I decided to take a brief diversion to look at Constant Rebalanced Portfolios and Universal Portfolios, lured by Max Dama’s post on UP (Universal Portfolios).   I had read papers on these in the past but never explored them empirically.

CRP is the underpinning of Universal Portfolios, so will focus on CRPs.   Simply stated the CRP approach allocates a fixed % of capital to each asset in the portfolio.   The portfolio is rebalanced each period as some assets will have disproportionally increased or decreased in value.   As Max points out, this is basically a mean-reversion scheme.

Unfortunately, this is a “blind” mean-reversion scheme in the sense that there is no measure of the likelihood of mean-reversion or the period over which it will take place.   The implicit assumption is that mean-reversion will occur between rebalancing periods.    The more worrying aspect is that money in “winning” assets will be diverted to “losing” assets (where by losing, refer to assets that trend downward with little MR in the upward direction).

Examples
The classic (and absurd) example where CRP does phenomenally well, is of a pair of assets where one asset is constant and the other asset appreciates and depreciates on alternating periods (in this case up 25% and down 25% repeatedly).   Needless to say, this provides exponential growth (I could only plot the first 100 days without obscuring the other detail):

Empirical Tests
I did not find that CRP did much better than the equivalent Buy & Hold portfolio with the same weightings.  Indeed, depending on transaction costs, could do significantly worse.     There are some asset sets, undoubtedly, that would do much better than Buy & Hold over specific periods, but would be few and far between given the fragility of CRP assumptions.

Making the Concept Work
The CRP concept is one of moving money away from assets we expect will mean-revert (in the negative direction) and increasing money in assets that will mean-revert (in the positive direction).   This is done in a rigid fashion and makes no observation as to whether a devaluing asset is likely to mean-revert in the next period.

It had occurred to me that modifying the rebalancing to take into account the likelihood of mean-reversion or more generally, the likelihood of appreciation or depreciation in the next period would be a better guide in rebalancing the portfolio.    Of course, the performance of such a scheme depends on the degree of accuracy of such measures.

Not surprisingly, there has been work in this area, for instance the ANTICOR algorithm described by (Borodin, El-Taniv, and Gogan) in “Can We Learn to Beat the Best Stock”.    Their approach is to use the autocorrelation and cross-correlation of prior periods to adjust the portfolio weighting in a CRP portfolio.

The fragility in this approach is two-fold:

  1. relies on fixed windows as a means to determine correlation or anti-correlation
    I would expect that mean-reversion cycle periods would differ for different assets.   That said, they need a way to compare across assets, so may be a reasonable compromise.
  2. correlation is a coarse measure
    There are other measures that may be more effective in determining future mean-reversion or direction.

That said, the approach is parsimonious and seems to have performed quite well in the empirical tests.

Pattern / Sequence Learning Approaches
There have been a number of papers describing approaches where the choice of weightings for the portfolio in the next period is determined based on finding prior patterns that match the current local pattern in past data.   I.e. the prior K returns are converted to a series of symbols and compared to historical sequences.   One attempts to locate the approximate sequence one or more more times in past history and observe the optimal portfolio weights that maximized cumulative return in the past.    This is repeated with varying length and discretization “fuzziness”.

The observed weights are blended based on the performance of the past weightings and degree of match with the current sequence.    The authors point to excellent results on empirical data.   It is surprising that there is a reasonable amount of information in the daily returns, I had thought would be more dominated by noise.

One Final Note
Although I have been “disparaging”  CRP, it can be shown that some weighting of CRP will yield the most optimal portfolio provided returns are i.i.d.   There lies the rub.

Leave a comment

Filed under machine-learning, portfolio

Advances in Computing

Have been watching advances in computing power for some time, since university really.   I did research in parallel algorithms and architecture for my first position at the university and later applied practically on Wall Street.   In those days super-expensive machines like the Intel Hypercube, Paragon, and many other architectures were the backbone of the HPC community.

HPC (High Performance Computing) roughly breaks down into 4 categories:

  1. Big iron supercomputers (MIMD generally)
  2. Distributed computing (these days advertised as Cloud Computing)
  3. The emerging SIMD GPU based solutions
  4. Quantum Computing (not really here yet for the mainstream)

In the machine learning and optimisation world there are massive problems, some of which are not computable on von-neumann architectures, as their runtime would be astronomical.    An (absurd) example of such a problem would be to simulate a large number monkeys typing on typewriters, stopping when one produces the works of Shakespeare.   The number of monkeys required to produce such a work on average in astronomical.     This seems like an absurd problem, but is comparable to the GP / GA approach.

Then of course there are numerous problems with high dimensionality and/or with polynomial order complexity.

Supercomputing on the Cheap
The FASTRA team at the University of Antwerp has put together an inexpensive multi-teraflop machine with 7 gaming cards.  Check out their video.

Unfortunately the “easy” part of these sort of solutions is the hardware.  The problem is the (often) great expense to develop one’s models in a SIMD framework, so can be applied to for the GPU architecture.    Although there is now standardization on the low-level C-variant used to program GPUs, there are significant differences between different models of GPUs, that even if you manage to write a correct SIMD program, may have to rearrange for a specific GPU implementation.   (I guess this is not all that different from my experiences with big-iron parallel architectures of the past).

One could have a team devoted to parallelization, tuning, and retuning / reworking for the new GPUs that are out periodically.   Very time consuming!

For my work, the problems that would map well are particle filters and monte-carlo based models, each of which have obvious fine-grained parallel operations.

Quantum Computing
The other notable announcement this week was Google’s use of quantum computing to solve pattern recognition problems.   I have not done the leg-work to fully understand the algorithms in quantum computing, but broadly it seems to be a matter of framing one’s problems statistically as path integration problems (i.e., expectations), where quantum computing allows the paths to be explored simultaneously.


5 Comments

Filed under HPC, machine-learning

Learning a Sequence

I had been looking at predicting durations (or the intensity) to model price behavior and variance estimation.    As mentioned previously, the prevalent ACD models in the literature do poorly.   Before moving on to another topic wanted to revisit this, with an idea for future approach.

Here is a sample of durations for a high-frequency price series:

9.30, 0.26, 0.28, 4.21, 0.04, 0.21, 3.23, 0.04, 2.28, ...

I decided that rather than trying to regress for specific durations, where there are an infinite number of possible values (theoretically), transform this into a set of symbols so that there are a finite number of states say:

S1, S2, S3, ...

where S1 might represent durations in [0, 0.25], S8 durations in [3, 3.5], etc.   The sequence of states for the above durations might look like:

12 → 1 → 1 → 9 → 1 → 1 → 8 → 1 → 7 → 6 → 6 → 6 ...

This turned out to be useful.

SVM
SVM on a radial basis kernel did a much better job of predicting the next symbol (duration) in a sequence than the ACD models.   It was still not  a suitable level of prediction however.

The problem with SVM and related approaches in general is that you either need to have a problem that can easily be categorized in high dimensional linear vector space.  A big part of this is finding the kernel that will map your (usually) non-linear vectors into a linearly separable space.    Also, SVM is arguably better suited to binary classification as opposed to multinary classification.

ANNs
In theory, an ANN with enough neurons can asymptotically approximate any function.  There are many problems in arriving at a general solution though:

  1. Calibration
    Standard techniques of backpropagation (essentially gradient descent) solve for a local optimum, which depends on the starting configuration.   A global optimum can be found with meta-heuristic approaches such as GAs, however, at significant computational cost.
  2. Overfitting
    It is very difficult to come up with networks that generalize.   Part of the success in doing this involves choosing training sets and configurations carefully.

Nevertheless, this may be an approach worth exploring.

Probabalistic Graph Models
As our duration pattern is essentially a transition from one state to the next, modeling as a probabalistic finite state machine appeals as model.  The idea with such an approach would be:

  1. empirically observe all chains of length ≤ some maximum
  2. determine the frequency of chains
  3. factorize into the smallest graph that reproduces those chains within some error

The chains, for instance:

A first approach to this problem is to consider whether can be modeled as a markovian state system.  It is, however, doubtful that the states {S1, S2, S3, …} can be modeled in a strictly markovian setting without the use of additional states.

For instance, is  P(S1|S2) the same as P(S1|S2, {prior states})?   The duration data shows dependence beyond the immediate prior state.    Therefore we have to expect that P(S1|S2, {S5,S1}) will differ from P(S1|S2, {S2,S3}), whereas in a markovian model, the probability of S1 can be conditioned purely on the prior state.

Such a markovian system might look like:

The HMM (Hidden Markov Model) combats this assumption by assuming that there is a hidden markovian process (usually with more states than the observed state system).   One can easily prove that a HMM of infinite size can exactly model all possible state chains (sequences) amongst a finite set of states.   Of course we are interested in a much smaller model that can reproduce most of the observed chains with limited error.

Here is a sample structure, where the black lines are edges between hidden states and the red edges indicate correspondence between hidden state and observed state.   The red edges are not traversed:

Aliasing Issues
Remember that we have arbitrarily subdivided durations (which are continuous) into N discrete states.   The idea was that the difference between say 0.25 seconds and 0.22 seconds is not important for our purposes.   One would think that less granular states will allow for  easier modeling of the state sequence.

The problem is that we are dividing these discretely.   We run into an aliasing problem where a specific duration partially belongs to the set represented by S(i) and S(i+1).   For instance for a sequence of length 3 we have 4 possible true state paths, each with associated probability.   Without compensating for aliasing we see the states (naively):

With aliasing we have the following possibilities:

As our path length approaches N, we will have 2^(N-1) possible paths.  One possible implementation of this is train with the M highest probability paths.

Fuzzy HMM
Aliasing is a kind of fuzzy set membership.   Aside from aliasing there are a number of reasons why we should consider fuzzy state membership:

  1. The data may be noisy, obscuring the pattern
  2. Discretisation error (aliasing)

Not surprisingly, other people have thought of fuzzy state membership in the context of HMM.   There are multiple fuzzy HMM models.   To be investigated …

Leave a comment

Filed under machine-learning, volatility

Duration Estimation

In a prior post mentioned that for intra-day variance prediction it made sense to separate variance into 2 processes:

  1. intensity process
    When is the next event going to occur;  lets call this Tprior + Δt.   This is the more complex process of the two to predict.
  2. power process
    What is the amplitude of the event at time Tnow + Δt.   The power or amplitude process seems to be fairly well behaved.   An ARMA style process seems like a likely candidate.

Towards this end, I have been exploring models for the intensity process.   Very often this is modeled in terms of duration.   Below is a summary of some results:

ACD Models
ACD processes make overreaching assumptions.  In particular ACD models assume a constant AR decay and innovation contribution across time.   Unfortunately this is not supported by empirical observations.   Here are some results for the best-fitting Wiebull ACD model on HF data:

The R^2 level of 0.0091 does not inspire confidence.

SVR Model
I used an iterative non-parametric machine learning approach (SVR) with a training set of 20 prior observations and a lagged series of the derivatives of the prior 20 durations as the input vector.   Training across the entire series, one gets an in-sample prediction R^2 of 0.9980.   Unfortunately, incremental out of sample does not fair as well:

Distribution of Durations
Here are 2 views on the distribution of durations:

Alternative Models
Some possibilities:

  1. markov chain (probabalistic state system)
    We model the patterns by categorizing the durations into K separate levels.   To train we observe the chain of states, say {K1, K8, K1,K1,K1,K4} and determine a graph describing the approximate event chains, factorizing and assigning probabilities.
  2. ANN
    Use a simple feed-forward network, trained with a GA or DE.   This is easy to implement but subject to a variety of problems such as overfitting.

As the ANN is easy to compose, will start there.

1 Comment

Filed under genetic programming, machine-learning, neural networks, statistics, volatility

Regression Approaches

One of the readers of this blog (skauf) suggested looking into KRLS (Kernel Recursive Least Squares), which is an “online” algorithm for multivariate regression closely related to SVM and Gaussian Processes approaches.    What struck me as clever in the KRLS algorithm is that it incorporates an online sparsification approach.  Why this is important, I will attempt to explain briefly in the next section.

The sparsification approach is similar in effect to PCA in that it reduces dimension and determines the most influential aspects of your system.  I had been thinking about a combined PCA / basis decomposition approach for some time, and it struck me that KRLS might be the way to do it.

Kernal-based Learning Algorithms
KRLS and SVM algorithms use a common approach to classification (or regression).   In the simplest case, let us assume that we want to find some function f(x) that will classify n-dimensional vectors as belonging to one of 2 sets (apples or oranges), which we will distinguish by + and – values from f(x):

Picture 1

Assuming the vectors X are linearly separable, we could try to find a hyperplane on the n-dimensional space that would separate the vectors such that there is a balance of distance between the two categories (and one that is maximal subject to some constraints).   The distance between the hyperplane (or in 2 dimensions a line) and the points is determined by a margin function.

Picture 2

Optimization is accomplished by maximizing the margins subject to constraints.  The optimization solves for the weights α  such that the sum of the weighted dot product of  X with each Xi yields the predicted value.  The form of f(x) is the following:

Picture 3

Now the above graph shows a very well-behaved data set, with a clear boundary for classification.  Many data sets, however, will have significant noise in the data and/or outliers that refuse to be categorized correctly.   This data has an overall impact on the regression line.

Supposing we have N samples in our training set {{X1,Y1}, {X2,Y2}, … {Xn, Yn}}.   In simple terms, sparsification is the process of removing (or discounting) samples that are already represented in the existing set, reducing the bias of the regressor.   One approach to determining the degree of representation is to observe the “degree of orthogonality” represented by a given vector with respect to the existing set.   Sparsification also points to an approach to evolving the regressor over time relative to new observations.

The Kernel
Above gave a brief overview of how SVM uses the notion of margin to classify data.  A kernel is not necessary for data that is linearly separable.   The kernel is “simply” a function that maps data from the “attribute” space to “feature” space.   We design or choose the kernel function so that our data is largely linearly separable in the “feature” space and with the constraint that the covariance matrix of all possible vectors mapped from attribute to feature space will be positive semi-definite.

Since our linear SVM equations are expressed in terms of inner products, given a feature mapping function Φ(x) mapping X from non-linear space to “linearly separable space”, we can express the kernel as a function of the inner product of two vectors, later to plug in to our linear equations:

Picture 5

The trick is to choose a kernel function that maximizes dispersion of the data into sets that are linearly separable.

How does this relate to an explicit probability based regression?
It turns out that the process of optimizing the margin function on a Gaussian kernel  is equivalent to finding the unnormalized maximum likelihood, whereas the Gaussian Process approach makes this explicit.

Which model does one choose?
The Gaussian Process approach is strictly bayesian.   It has the upside of providing explicit probabilities / confidence measures on predictions.  If one knows the likelihoods of the desired labels with respect to the attribute vectors, the model works very well and provides more information than the SVM family.

The SVM family of approaches use a margin function to determine the similarity between vectors (for the purpose of classification).   This does not explicitly involve probabilities, but for the Gaussian kernel can be shown to be equivalent to the Gaussian Process approach.   The SVM family has the upside that one does not need to know the likelihoods.

Finally, both models allow the use of kernels to map data from a non-normal or non-linear space to a linear or gaussian space.   Some have shown that there is a degree of equivalence between the likelihood function in the Gaussian Process algorithm and the kernel function used in SVM.

Leave a comment

Filed under AI, machine-learning, statistics

Decision Tree Learning

Some time ago began using bayesian networks (decision trees with conditional relationships) to boost signal for strategies, the idea being that a combination of related observations can be combined conditionally to provide a posterior conclusion with a higher degree of confidence.

In the context of trading our bayesian network tells us:

  1. given the coincident set of events should we trade and if so, in what direction
  2. with what degree of confidence

My approach up until now had been to carefully construct the relationships.  It requires painstaking research and a lot of trial and error.   Why not use an algorithm to assemble / find relationships, and classify data.

General Approaches
The more general decision tree algorithms allow me to present many factors (even with high dimension), determine which factors have the highest degree of information relative to our classification targets (buy, sell, don’t-trade), and formulate a classification or regression that models this.

Key to the assembly on the tree are measures taken from information theory.  We want to make arrangements in the tree such that the “entropy” of the tree is minimized or in other words information is maximized.   This is often calculated with a discrete form of the Kullback-Leibler divergence metric.

Algorithm
In particular the “Random Forest” approach is quite appealing.   Multiple decision trees are constructed against training subsets drawn randomly from a larger training set.   These ultimately produce many variations on the “true” model.   The approximation to the true model is made by observing the mode (similar to a robust mean / expectation) across the random trees.

Taking advantage of the classification ability of the algorithm is going to allow me to try many new inputs in my strategies without the huge research overhead and without the near-intractable multivariate optimization required in other approaches.   I’ll post some results on this soon.

Leave a comment

Filed under machine-learning