CRP = CR*P?

Procrastinating on a hard problem, I decided to take a brief diversion to look at Constant Rebalanced Portfolios and Universal Portfolios, lured by Max Dama’s post on UP (Universal Portfolios).   I had read papers on these in the past but never explored them empirically.

CRP is the underpinning of Universal Portfolios, so will focus on CRPs.   Simply stated the CRP approach allocates a fixed % of capital to each asset in the portfolio.   The portfolio is rebalanced each period as some assets will have disproportionally increased or decreased in value.   As Max points out, this is basically a mean-reversion scheme.

Unfortunately, this is a “blind” mean-reversion scheme in the sense that there is no measure of the likelihood of mean-reversion or the period over which it will take place.   The implicit assumption is that mean-reversion will occur between rebalancing periods.    The more worrying aspect is that money in “winning” assets will be diverted to “losing” assets (where by losing, refer to assets that trend downward with little MR in the upward direction).

Examples
The classic (and absurd) example where CRP does phenomenally well, is of a pair of assets where one asset is constant and the other asset appreciates and depreciates on alternating periods (in this case up 25% and down 25% repeatedly).   Needless to say, this provides exponential growth (I could only plot the first 100 days without obscuring the other detail):

Empirical Tests
I did not find that CRP did much better than the equivalent Buy & Hold portfolio with the same weightings.  Indeed, depending on transaction costs, could do significantly worse.     There are some asset sets, undoubtedly, that would do much better than Buy & Hold over specific periods, but would be few and far between given the fragility of CRP assumptions.

Making the Concept Work
The CRP concept is one of moving money away from assets we expect will mean-revert (in the negative direction) and increasing money in assets that will mean-revert (in the positive direction).   This is done in a rigid fashion and makes no observation as to whether a devaluing asset is likely to mean-revert in the next period.

It had occurred to me that modifying the rebalancing to take into account the likelihood of mean-reversion or more generally, the likelihood of appreciation or depreciation in the next period would be a better guide in rebalancing the portfolio.    Of course, the performance of such a scheme depends on the degree of accuracy of such measures.

Not surprisingly, there has been work in this area, for instance the ANTICOR algorithm described by (Borodin, El-Taniv, and Gogan) in “Can We Learn to Beat the Best Stock”.    Their approach is to use the autocorrelation and cross-correlation of prior periods to adjust the portfolio weighting in a CRP portfolio.

The fragility in this approach is two-fold:

  1. relies on fixed windows as a means to determine correlation or anti-correlation
    I would expect that mean-reversion cycle periods would differ for different assets.   That said, they need a way to compare across assets, so may be a reasonable compromise.
  2. correlation is a coarse measure
    There are other measures that may be more effective in determining future mean-reversion or direction.

That said, the approach is parsimonious and seems to have performed quite well in the empirical tests.

Pattern / Sequence Learning Approaches
There have been a number of papers describing approaches where the choice of weightings for the portfolio in the next period is determined based on finding prior patterns that match the current local pattern in past data.   I.e. the prior K returns are converted to a series of symbols and compared to historical sequences.   One attempts to locate the approximate sequence one or more more times in past history and observe the optimal portfolio weights that maximized cumulative return in the past.    This is repeated with varying length and discretization “fuzziness”.

The observed weights are blended based on the performance of the past weightings and degree of match with the current sequence.    The authors point to excellent results on empirical data.   It is surprising that there is a reasonable amount of information in the daily returns, I had thought would be more dominated by noise.

One Final Note
Although I have been “disparaging”  CRP, it can be shown that some weighting of CRP will yield the most optimal portfolio provided returns are i.i.d.   There lies the rub.

Leave a comment

Filed under machine-learning, portfolio

Advances in Computing

Have been watching advances in computing power for some time, since university really.   I did research in parallel algorithms and architecture for my first position at the university and later applied practically on Wall Street.   In those days super-expensive machines like the Intel Hypercube, Paragon, and many other architectures were the backbone of the HPC community.

HPC (High Performance Computing) roughly breaks down into 4 categories:

  1. Big iron supercomputers (MIMD generally)
  2. Distributed computing (these days advertised as Cloud Computing)
  3. The emerging SIMD GPU based solutions
  4. Quantum Computing (not really here yet for the mainstream)

In the machine learning and optimisation world there are massive problems, some of which are not computable on von-neumann architectures, as their runtime would be astronomical.    An (absurd) example of such a problem would be to simulate a large number monkeys typing on typewriters, stopping when one produces the works of Shakespeare.   The number of monkeys required to produce such a work on average in astronomical.     This seems like an absurd problem, but is comparable to the GP / GA approach.

Then of course there are numerous problems with high dimensionality and/or with polynomial order complexity.

Supercomputing on the Cheap
The FASTRA team at the University of Antwerp has put together an inexpensive multi-teraflop machine with 7 gaming cards.  Check out their video.

Unfortunately the “easy” part of these sort of solutions is the hardware.  The problem is the (often) great expense to develop one’s models in a SIMD framework, so can be applied to for the GPU architecture.    Although there is now standardization on the low-level C-variant used to program GPUs, there are significant differences between different models of GPUs, that even if you manage to write a correct SIMD program, may have to rearrange for a specific GPU implementation.   (I guess this is not all that different from my experiences with big-iron parallel architectures of the past).

One could have a team devoted to parallelization, tuning, and retuning / reworking for the new GPUs that are out periodically.   Very time consuming!

For my work, the problems that would map well are particle filters and monte-carlo based models, each of which have obvious fine-grained parallel operations.

Quantum Computing
The other notable announcement this week was Google’s use of quantum computing to solve pattern recognition problems.   I have not done the leg-work to fully understand the algorithms in quantum computing, but broadly it seems to be a matter of framing one’s problems statistically as path integration problems (i.e., expectations), where quantum computing allows the paths to be explored simultaneously.


5 Comments

Filed under HPC, machine-learning

Learning a Sequence

I had been looking at predicting durations (or the intensity) to model price behavior and variance estimation.    As mentioned previously, the prevalent ACD models in the literature do poorly.   Before moving on to another topic wanted to revisit this, with an idea for future approach.

Here is a sample of durations for a high-frequency price series:

9.30, 0.26, 0.28, 4.21, 0.04, 0.21, 3.23, 0.04, 2.28, ...

I decided that rather than trying to regress for specific durations, where there are an infinite number of possible values (theoretically), transform this into a set of symbols so that there are a finite number of states say:

S1, S2, S3, ...

where S1 might represent durations in [0, 0.25], S8 durations in [3, 3.5], etc.   The sequence of states for the above durations might look like:

12 → 1 → 1 → 9 → 1 → 1 → 8 → 1 → 7 → 6 → 6 → 6 ...

This turned out to be useful.

SVM
SVM on a radial basis kernel did a much better job of predicting the next symbol (duration) in a sequence than the ACD models.   It was still not  a suitable level of prediction however.

The problem with SVM and related approaches in general is that you either need to have a problem that can easily be categorized in high dimensional linear vector space.  A big part of this is finding the kernel that will map your (usually) non-linear vectors into a linearly separable space.    Also, SVM is arguably better suited to binary classification as opposed to multinary classification.

ANNs
In theory, an ANN with enough neurons can asymptotically approximate any function.  There are many problems in arriving at a general solution though:

  1. Calibration
    Standard techniques of backpropagation (essentially gradient descent) solve for a local optimum, which depends on the starting configuration.   A global optimum can be found with meta-heuristic approaches such as GAs, however, at significant computational cost.
  2. Overfitting
    It is very difficult to come up with networks that generalize.   Part of the success in doing this involves choosing training sets and configurations carefully.

Nevertheless, this may be an approach worth exploring.

Probabalistic Graph Models
As our duration pattern is essentially a transition from one state to the next, modeling as a probabalistic finite state machine appeals as model.  The idea with such an approach would be:

  1. empirically observe all chains of length ≤ some maximum
  2. determine the frequency of chains
  3. factorize into the smallest graph that reproduces those chains within some error

The chains, for instance:

A first approach to this problem is to consider whether can be modeled as a markovian state system.  It is, however, doubtful that the states {S1, S2, S3, …} can be modeled in a strictly markovian setting without the use of additional states.

For instance, is  P(S1|S2) the same as P(S1|S2, {prior states})?   The duration data shows dependence beyond the immediate prior state.    Therefore we have to expect that P(S1|S2, {S5,S1}) will differ from P(S1|S2, {S2,S3}), whereas in a markovian model, the probability of S1 can be conditioned purely on the prior state.

Such a markovian system might look like:

The HMM (Hidden Markov Model) combats this assumption by assuming that there is a hidden markovian process (usually with more states than the observed state system).   One can easily prove that a HMM of infinite size can exactly model all possible state chains (sequences) amongst a finite set of states.   Of course we are interested in a much smaller model that can reproduce most of the observed chains with limited error.

Here is a sample structure, where the black lines are edges between hidden states and the red edges indicate correspondence between hidden state and observed state.   The red edges are not traversed:

Aliasing Issues
Remember that we have arbitrarily subdivided durations (which are continuous) into N discrete states.   The idea was that the difference between say 0.25 seconds and 0.22 seconds is not important for our purposes.   One would think that less granular states will allow for  easier modeling of the state sequence.

The problem is that we are dividing these discretely.   We run into an aliasing problem where a specific duration partially belongs to the set represented by S(i) and S(i+1).   For instance for a sequence of length 3 we have 4 possible true state paths, each with associated probability.   Without compensating for aliasing we see the states (naively):

With aliasing we have the following possibilities:

As our path length approaches N, we will have 2^(N-1) possible paths.  One possible implementation of this is train with the M highest probability paths.

Fuzzy HMM
Aliasing is a kind of fuzzy set membership.   Aside from aliasing there are a number of reasons why we should consider fuzzy state membership:

  1. The data may be noisy, obscuring the pattern
  2. Discretisation error (aliasing)

Not surprisingly, other people have thought of fuzzy state membership in the context of HMM.   There are multiple fuzzy HMM models.   To be investigated …

Leave a comment

Filed under machine-learning, volatility

Mean in the context of Mean-Reversion

I want a running mean estimator that acts as a mode through mean reversion cycles of target amplitude or frequency.   The key characteristics should be:

  • adaptation to local volatility
    • determination of diffusion related squared return
    • determination of jump related squared return
    • determination as to how much of the jump should be absorbed into the mean
  • model of mean reversion
    • calibrated to a desired long-run rate of reversion
    • allowance for changes in reversion constant and reversion to long run
  • model of mean
    • autoregressive
    • innovations scaled by sigma term (with MR component and jumps removed)
  • recursive backward estimation of ML
    • implicitly decide how innovation is distributed amongst mean, mean-reversion, and noise

A SDE-based Approach
The model is an expanded variant of the familiar OrnsteinUhlenbeck process, with specialized mean-reversion, mean, and volatility processes.   It also attempts to correct for jumps.    Let’s start with the following SDEs (in continuous time):

Variance
There are many approaches to modeling volatility (all with issues).   Initially I had though to use a predictive model based on:

  • intensity process (based on “first exit” duration)
    This is a very complex process.  First approximations have been to use ACD, a family of AR models for duration.   ACD models perform very poorly on HF data however.    It seems that a markov chain model recognizing the patterns will be most appropriate.
  • amplitude process
    The amplitudes of squared returns seem to follow a largely AR process.   This seems fairly well behaved.

Before fully committing to a complex volatility model thought its makes sense to first try with a non-predictive measure of realized variance.  I will use:

The choice of α determines the degree of smoothing with previous values based on how local (and noisy) we want this function to be.   For example, here is the estimate with a smoothing factor of 60 and a threshold of 3e-5:

Discretising
Using Ito’s lemma we discretise the processes as follows:

Simplifying the volatility term in S(t), we first determine the variance of the SDE:

We reorganize as follows:

Putting it together
We can now model this discretely as a state-space based filter, searching for parameters that fit a-posteriori idealized view on the mode and mean-reversion process.   Post-parameterization, the process can be used in real-time to provide an estimate of the mode.

Final Notes
As you may have seen I took a (useful) 2-3 week diversion before coming back to the SDE based approach.   This is not a final model by any means, but I think a a solid starting point.    The purpose of the above is as a one of a number of factors in a multi-factor  strategy that want to optimize further.

Leave a comment

Filed under mean, state-space-models, statistics, stochatistic, volatility

Duration Estimation

In a prior post mentioned that for intra-day variance prediction it made sense to separate variance into 2 processes:

  1. intensity process
    When is the next event going to occur;  lets call this Tprior + Δt.   This is the more complex process of the two to predict.
  2. power process
    What is the amplitude of the event at time Tnow + Δt.   The power or amplitude process seems to be fairly well behaved.   An ARMA style process seems like a likely candidate.

Towards this end, I have been exploring models for the intensity process.   Very often this is modeled in terms of duration.   Below is a summary of some results:

ACD Models
ACD processes make overreaching assumptions.  In particular ACD models assume a constant AR decay and innovation contribution across time.   Unfortunately this is not supported by empirical observations.   Here are some results for the best-fitting Wiebull ACD model on HF data:

The R^2 level of 0.0091 does not inspire confidence.

SVR Model
I used an iterative non-parametric machine learning approach (SVR) with a training set of 20 prior observations and a lagged series of the derivatives of the prior 20 durations as the input vector.   Training across the entire series, one gets an in-sample prediction R^2 of 0.9980.   Unfortunately, incremental out of sample does not fair as well:

Distribution of Durations
Here are 2 views on the distribution of durations:

Alternative Models
Some possibilities:

  1. markov chain (probabalistic state system)
    We model the patterns by categorizing the durations into K separate levels.   To train we observe the chain of states, say {K1, K8, K1,K1,K1,K4} and determine a graph describing the approximate event chains, factorizing and assigning probabilities.
  2. ANN
    Use a simple feed-forward network, trained with a GA or DE.   This is easy to implement but subject to a variety of problems such as overfitting.

As the ANN is easy to compose, will start there.

1 Comment

Filed under genetic programming, machine-learning, neural networks, statistics, volatility

Durations on Intraday Price Series

As mentioned in a previous post, I intend to model quadratic variation in terms of multiple pairings of intensity (duration) and return level processes.   At a minimum want a pairing for “non-jump” related returns and a pairing for “jump” related returns.

To do this it is necessary to partition returns into the categories based on threshold.   We may further want to disregard price movements below a certain level unless they cumulatively add up to a return with significance within a period.   Towards this end my duration measurement function uses a threshold to determine whether a return is to be considered as an event or not.  In pseudocode:

r ← {0} ∪ diff(log(series))
t ← times (series)
durations ← {}
for (i in 2:length(r))
{
    # determine cumulative return since last acceptance
    cumr ← <cummulative return since last event or max cum window>

    # determine whether qualifying event has occurred
    if (|cumr| ≥ threshold or |r[i]| ≥ threshold)
        durations ← durations ∪ {t[i] - <Tlastevent>}
}

For the diffusion portion of the process, in this 2 second sampled data set (EUR/USD low-liquidity period), a threshold of 3e-5 (equivalent of about 1/2 pip), seems to work well:

The jump portion of the process should be set so as to capture desired jump features and not much more, here I show with a threshold of 2e-4 (equivalent to about 3 pips):

Leave a comment

Filed under statistics, volatility

Anatomy of a Spike

Spikes manifest in intra-day markets frequently.   These are often short-lived and associated with buying/selling programs more often than change in fundamental factors, particularly in low-liquidity periods.   In evaluating duration and variance measures was trying to determine reasonable jump thresholds.

Below is a price series demonstrating a variety of spiking behavior:

Taking a look at the region around the jump at high frequency, we note that the jump did not occur with one trade rather with multiple within a short space of time:

From a duration perspective, if we want to capture the spike as one event of a given magnitude we either need to consider the cumulative return over a given window or sample with a longer period.   Here is the same with a 2 second sample:

Here is the 2 second sampling of the original range.   With the longer sample period, the spikes in return are more evident (compare to the first graph in the post):

 

 

Leave a comment

Filed under statistics, volatility

Rethinking Variance

Volatility (variance) estimation is probably one of the most difficult areas of research in finance.   As mentioned in previous posts, recent research has been largely focused on duration based models for variance estimation.

I’ve done some experimentation with duration-based models, but have yet to find a satisfactory approach, but feel I am now beginning to understand what a satisfactory model might look like.    I don’t have the luxury of time to make it a full study, but hope to come up with an approach, however heuristic, that meets my needs.

There is a view that volatility (quadratic variation) can be separated into at least 2 distinct components:

  • integrated variance (IV)
    This is the primary return variation generated by the diffusive process component in our price process
  • squared jump term
    Jumps due to news or other behavior, not modeled by the diffusive process.

Together these explain what we call quadratic variance (QV):

The fundamental that underlies both “jumps” and “local volatility” is the notion of expected squared returns over a given time interval (in the continuous case infinitesimal, in real markets, discretely).   A pdf for the probability of a “jump” of size J at time t can be constructed as:

Where f(I) is the indicator of the “jump” event pdf and f(Δr|I,F) is the conditional distribution of jump size Δr given the event.  Together provide a posterior representing the probability of a price movement of Δr.   Variance can then be estimated with either the continuous or discrete form:

Thinking about “jumps” versus diffusion based volatility (as in the local vol estimates σ^2), the processes of the 2 differ in terms of intensity.   However we can still estimate both with the above expectation.   Both jumps and local vol can be estimated with the same approach, only the parameterization of the probability distribution need differ for different price jump levels (Δt).

One of the “hangups” in my recent attempts to model variance was in trying to model the timing and size of jumps as one process.   Now, what if we split this out:

  1. model the duration (or intensity) of events in jump level range
    Use an exponential-ACD model calibrated for duration between events in the range.
  2. model the size of events within the range
    Consider an autoregressive self-exciting process.  The AR component will model the reversion to the mean.

I can model variance as a set of self-exciting intensity processes, one for each range of price (return) levels.   I can the use the specialization of the f(I) and  f(Δr|I,F) conditional distributions for each price range to compute expected squared returns:

The indicator process
The indicator (or intensity) process can be modeled with an ACD model such as the Burr-ACD (which has shown to fit well empirically).   This can be calibrated for 2 sets of durations, one for “local vol” and one for “jump vol”.   Potentially we could consider sub-dividing into more levels.

The ACD process as developed by Engle and Russell specifies the duration as a mixing process:

The conditional density of Di is is then:

The ultimate form is subject to the choice of hazard function.   The general form of the ACD update equation is reminiscent of GARCH:

Where g(D) is an analog of our hazard function for the distribution. In our case we will use the Burr distribution as proposed by Grammig and Maurer (2000).  The Burr distribution density, survival and hazard functions are given by:

We use the Burr density as the distribution ξ[i] for the indicator component ε[i] in Di = ψ[i]ε[i].  We choose a density for the distribution of ψ[i] as a gamma mixture (from Guirreri 2009):

And the distribution of the indicator process ε[i] as the Burr distribution from above.   To determine the probability of duration D[i] given ψ[i], as we have  ε[i] = D[i] / ψ[i]:

Addendum: having done more exploration on this, looks like the Stochastic Volatility Duration model proposed by Ghysels,  Gourieroux, Jasiak, and Joann (2004), would offer a much better prediction capability.

The level process
The jump level process can be relatively simple.   We observe autoregressive decay across returns, both for diffusion related and jump returns.   There are gaps in between return events in both scenarios, but our mixture with the indicator process will allow us to model this.

A relatively simple autoregressive process can be used to model the event size decay from a self-exciting jump level for each range.   Given that observations will be made at non-uniform times, an exponential time-dependent AR model is probably best.   The distribution will need to be estimated.

Conclusion
Model volatility with two processes:

  1. intensity process to represent periodicity / gapping behavior
  2. magnitude process representing self-exciting AR behavior

Addendum: discovered that others are thinking along the same lines.   There are a couple of papers circa 2008/2009 proposing models similar to what I have outlined.

1 Comment

Filed under strategies

Misadventures in Variance Estimation

I’ve been looking at a number of different volatility models for both daily and intra-day variance.   In the process have looked at GARCH(1,1) for daily and have developed a number of duration based variance (DV) models for intra-day.

Daily Returns
I had concluded early on that GARCH(1,1) generally fits daily data well but fails miserably with intra-day data.    Unfortunately, for some markets, GARCH performs miserably as well.

I was looking at Canadian 2Y CMT yields over the last 10 years.   There are significant spikes in return for any given year.    For example in 2003, on the raw daily returns, GARCH does poorly (squared returns: black, sigma squared: red):

As an experiment I used (very) lightly smoothed prices as a basis for GARCH testing.  The smoothing gave continuity to the price function and to the 1st derivatives (returns).    GARCH(1,1) calibrated on the same data set, but with minimal smoothing yields a much higher degree of fit (25 x):

Of course we are now projecting sigma^2 on a lightly smoothed price series and not the original series.   Note that the magnitude of returns has reduced as well, as the smoothing allows for more gradual transitions.

Let’s look at a smaller section of returns on the raw price series:

Visually there appears to be a autoregressive component in squared returns, but involves jumps followed by periods of near zero return.   Subsequent jumps appear to follow a near linear decay in magnitude.    You will note that the last set appears to be a poor fit, however, there is a cluster of low jumps, that when considered in intensity space would have a higher combined value (ie, consider the cumulative return over a small window).

This behavior would point to modeling this as a intensity function (equivalent to duration).

Intra-day Squared Returns
Autocorrelation plots of liquid periods and illiquid periods both show significant autocorrelation out to 10+ seconds, indicating that we should be looking at an AR process at some level with long memory:

Thus far the most successful model is a duration based hawkes process which is “infinitely” autoregressive with exponential decay.

Leave a comment

Filed under statistics, volatility

Calibrating Non-Convex Functions

Many of the calibrations I perform are on non-convex functions, functions with multiple local minima/maxima.    Solving a min(max)imization for this class of function precludes a wide range of algorithms based on gradient descent (such as BFGS).    There are a number of “metaheuristic” approaches to solving non-convex problems, such as: simulated annealing, genetic algorithms, differential evolution, particle swarm, and brute force grid search, amongst others.

For the last couple of years I have been using JGAP, a library for optimization via genetic algorithm.    For some time was under the impression that most GA libraries were similar in terms of core algorithm and just different at the API level.   I ran into some problems with the new variance model I am working on, where JGAP was not converging in the expected manner.

I recently began using a package in R for ad-hoc calibrations (genoud) and found that it converges quickly, whereas JGAP did very poorly in comparison.   DEoptim is a very similar package (a straight implementation of differential evolution). Differential Evolution  a cousin of GA, but with a different approach to population evolution.

Differential Evolution
Differential Evolution is very similar to standard GA approaches in that both involve working towards the optimum via a randomly generated population of “trial vectors” (θ) and a directed evolution of those vectors towards the optimum.   Differential evolution is implemented as follows:

generate Population = { θ1, θ2, θ3, ... } where θ in domain of function
determine θbest ← maxarg [θ : fitness(θ)] forall θ
repeat
   foreach θi in Population
      draw { θa, θb, θc } randomly from P (without duplication)
      θd ← θa + F(θb - θc)
      θnew ← crossover (θd, θi) 

      # adjust ith vector if fitness of generated superior
      if fitness(θnew) > fitness(θi) then
         θi ← θnew

      # adjust best vector if new optimum seen
      if fitness(θnew) > fitness(θbest) then
         θbest ← θnew
   end
until fitness(θbest) approaches target

One of the key innovations is that the variance of the innovations (mutations) is dependent on the dispersion of the vectors.  At the outset the dispersion will be large, and as the optimization progresses, the dispersion will reduce and hence mutations will be more focused.   The choice of the function F(), operating on the difference of vectors, has various effects on the speed of convergence and exploration of the domain.  I will not go into it here.

Improvements with Hybrid Approach
Provided one has a continuous function with derivatives within the neighborhood of a trial vector, the algorithm can be improved with the use of gradient descent (such as the BFGS algorithm) to locate the true optimum within a neighborhood.    Care must be taken not to optimize too early such that the diversity of trials is lost.

Some Tests
Here are some results on the two R packages mentioned above.  Ultimately I need to get an implementation into Java, but wanted to do a comparison with existing implementations first.   I ran optimizations on a particularly pathological gaussian mixture (taken from the examples in the genoud package):

Which looks like:

It is clear that gradient descent (ascent) solvers will gravitate to one of the local maxima with the above function.   Testing DEoptim with a population of 100 and similar parameters, both arrived within 1e-4 of the answer within 10 generations (without any BFGS optimization) and 5e-6 within 60 generations.

With BFGS optimization on, genoud arrived within 1e-10 within 2 generations!   Unfortunately genoud arrived at the wrong maximum with some other (more complex) mixture function, whereas the pure DE approach arrived at the proper one.    With appropriate parameterisation can be avoided, however.

Leave a comment

Filed under genetic algorithms