Skip to content

Conversation

@feynmanliang
Copy link
Contributor

In-place updates, reduce number of transposes, and vectorize operations in OnlineLDA implementation.

@SparkQA
Copy link

SparkQA commented Jul 17, 2015

Test build #37550 has finished for PR 7454 at commit aead650.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class UnresolvedAttribute(nameParts: Seq[String]) extends Attribute
    • abstract class Star extends LeafExpression with NamedExpression
    • case class UnresolvedAlias(child: Expression) extends UnaryExpression with NamedExpression
    • case class SortOrder(child: Expression, direction: SortDirection) extends UnaryExpression
    • trait AggregateExpression extends Expression
    • trait PartialAggregate extends AggregateExpression
    • case class Min(child: Expression) extends UnaryExpression with PartialAggregate
    • case class Max(child: Expression) extends UnaryExpression with PartialAggregate
    • case class Count(child: Expression) extends UnaryExpression with PartialAggregate
    • case class Average(child: Expression) extends UnaryExpression with PartialAggregate
    • case class Sum(child: Expression) extends UnaryExpression with PartialAggregate
    • case class SumDistinct(child: Expression) extends UnaryExpression with PartialAggregate
    • case class First(child: Expression) extends UnaryExpression with PartialAggregate
    • case class Last(child: Expression) extends UnaryExpression with PartialAggregate
    • trait Generator extends Expression
    • case class Explode(child: Expression) extends UnaryExpression with Generator
    • trait NamedExpression extends Expression
    • abstract class Attribute extends LeafExpression with NamedExpression
    • case class PrettyAttribute(name: String) extends Attribute
    • abstract class LeafNode extends LogicalPlan
    • abstract class UnaryNode extends LogicalPlan

@SparkQA
Copy link

SparkQA commented Jul 17, 2015

Test build #37553 has finished for PR 7454 at commit c62cb1e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class StandaloneRecoveryModeFactory(conf: SparkConf, serializer: Serializer)
    • class RFormula(override val uid: String)
    • case class UnresolvedAttribute(nameParts: Seq[String]) extends Attribute
    • abstract class Star extends LeafExpression with NamedExpression
    • case class UnresolvedAlias(child: Expression) extends UnaryExpression with NamedExpression
    • abstract class LeafExpression extends Expression
    • abstract class UnaryExpression extends Expression
    • abstract class BinaryExpression extends Expression
    • case class SortOrder(child: Expression, direction: SortDirection) extends UnaryExpression
    • trait AggregateExpression extends Expression
    • trait PartialAggregate extends AggregateExpression
    • case class Min(child: Expression) extends UnaryExpression with PartialAggregate
    • case class Max(child: Expression) extends UnaryExpression with PartialAggregate
    • case class Count(child: Expression) extends UnaryExpression with PartialAggregate
    • case class Average(child: Expression) extends UnaryExpression with PartialAggregate
    • case class Sum(child: Expression) extends UnaryExpression with PartialAggregate
    • case class SumDistinct(child: Expression) extends UnaryExpression with PartialAggregate
    • case class First(child: Expression) extends UnaryExpression with PartialAggregate
    • case class Last(child: Expression) extends UnaryExpression with PartialAggregate
    • trait Generator extends Expression
    • case class Explode(child: Expression) extends UnaryExpression with Generator
    • trait NamedExpression extends Expression
    • abstract class Attribute extends LeafExpression with NamedExpression
    • case class PrettyAttribute(name: String) extends Attribute
    • case class Length(child: Expression) extends UnaryExpression with ExpectsInputTypes
    • case class FormatNumber(x: Expression, d: Expression)
    • abstract class LeafNode extends LogicalPlan
    • abstract class UnaryNode extends LogicalPlan
    • abstract class BinaryNode extends LogicalPlan

@SparkQA
Copy link

SparkQA commented Jul 17, 2015

Test build #37610 has finished for PR 7454 at commit 7f62a55.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

I'll make a pass. Can you please make a JIRA for this and put it in the title?

Also, can you please test this to verify the speedups? It sounds like local tests could suffice, based on the changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a val since you use Breeze to mutate the internals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, fixed.

@jkbradley
Copy link
Member

(maybe not; see below first)
I think there's a bug. I tried running the LDAExample as follows, and it failed with the following exception:

I ran:

bin/run-example mllib.LDAExample docs/*.md --maxIterations 2 --algorithm online --vocabSize 10 --k 3

and got the exception:

 6 had an illegal value 6 had an illegal value

15/07/20 22:20:57 WARN TaskSetManager: Lost task 8.0 in stage 12.0 (TID 395, localhost): java.lang.Error
    at org.j_paine.formatter.FormatParser.<init>(FormatParser.java:353)
    at org.j_paine.formatter.FormatParser.<init>(FormatParser.java:346)
    at org.j_paine.formatter.Parsers.<init>(Formatter.java:1748)
    at org.j_paine.formatter.Parsers.theParsers(Formatter.java:1739)
    at org.j_paine.formatter.Format.<init>(Formatter.java:177)
    at org.j_paine.formatter.Formatter.<init>(Formatter.java:30)
    at org.netlib.util.Util.f77write(Util.java:429)
    at org.netlib.err.Xerbla.xerbla(err.f)
    at org.netlib.blas.Dgemv.dgemv(blas.f)
    at com.github.fommil.netlib.F2jBLAS.dgemv(F2jBLAS.java:106)
    at breeze.linalg.operators.DenseMatrixMultiplyStuff$implOpMulMatrix_DMD_DVD_eq_DVD$.apply(DenseMatrixOps.scala:80)
    at breeze.linalg.operators.DenseMatrixMultiplyStuff$implOpMulMatrix_DMD_DVD_eq_DVD$.apply(DenseMatrixOps.scala:72)
    at breeze.linalg.ImmutableNumericOps$class.$times(NumericOps.scala:135)
    at breeze.linalg.De ** On entry to nseMatrix.$times(DenseMatrix.scala:53)
    at org.apache.spark.mllib.clustering.OnlineLDAOptimizer$$anonfun$8$$anonfun$apply$4.apply(LDAOptimizer.scala:395)
    at org.apache.spark.mllib.clustering.OnlineLDAOptimizer$$anonfun$8$$anonfun$apply$4.apply(LDAOptimizer.scala:380)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at org.apache.spark.util.random.GapSamplingReplacementIterator.foreach(RandomSampler.scala:271)
    at org.apache.spark.mllib.clustering.OnlineLDAOptimizer$$anonfun$8.apply(LDAOptimizer.scala:380)
    at org.apache.spark.mllib.clustering.OnlineLDAOptimizer$$anonfun$8.apply(LDAOptimizer.scala:378)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDDGEMV  parameter number  6 had an illegal value
D.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

 ** On entry to DGEMV  parameter number  6 had an illegal value
15/07/20 22:20:57 ERROR TaskSetManager: Task 8 in stage 12.0 failed 1 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 12.0 failed 1 times, most recent failure: Lost task 8.0 in stage 12.0 (TID 395, localhost): java.lang.Error
    at org.j_paine.formatter.FormatParser.<init>(FormatParser.java:353)
    at org.j_paine.formatter.FormatParser.<init>(FormatParser.java:346)
    at org.j_paine.formatter.Parsers.<init>(Formatter.java:1748)
    at org.j_paine.formatter.Parsers.theParsers(Formatter.java:1739)
    at org.j_paine.formatter.Format.<init>(Formatter.java:177)
    at org.j_paine.formatter.Formatter.<init>(Formatter.java:30)
    at org.netlib.util.Util.f77write(Util.java:429)
    at org.netlib.err.Xerbla.xerbla(err.f)
    at org.netlib.blas.Dgemv.dgemv(blas.f)
    at com.github.fommil.netlib.F2jBLAS.dgemv(F2jBLAS.java:106)
    at breeze.linalg.operators.DenseMatrixMultiplyStuff$implOpMulMatrix_DMD_DVD_eq_DVD$.apply(DenseMatrixOps.scala:80)
    at breeze.linalg.operators.DenseMatrixMultiplyStuff$implOpMulMatrix_DMD_DVD_eq_DVD$.apply(DenseMatrixOps.scala:72)
    at breeze.linalg.ImmutableNumericOps$class.$times(NumericOps.scala:135)
    at breeze.linalg.DenseMatrix.$times(DenseMatrix.scala:53)
    at org.apache.spark.mllib.clustering.OnlineLDAOptimizer$$anonfun$8$$anonfun$apply$4.apply(LDAOptimizer.scala:395)
    at org.apache.spark.mllib.clustering.OnlineLDAOptimizer$$anonfun$8$$anonfun$apply$4.apply(LDAOptimizer.scala:380)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at org.apache.spark.util.random.GapSamplingReplacementIterator.foreach(RandomSampler.scala:271)
    at org.apache.spark.mllib.clustering.OnlineLDAOptimizer$$anonfun$8.apply(LDAOptimizer.scala:380)
    at org.apache.spark.mllib.clustering.OnlineLDAOptimizer$$anonfun$8.apply(LDAOptimizer.scala:378)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1295)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1286)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1285)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1285)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:752)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:752)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:752)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1506)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1467)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1456)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

@jkbradley
Copy link
Member

I'm wondering if it's a mis-matched shape issue.

@jkbradley
Copy link
Member

Ohh, actually, it might be from me trying to stats...which might be some weird Breeze object which does not implement toString properly. Let me retry

@jkbradley
Copy link
Member

Hm, no, I think something is wrong. Can you try running the example as I wrote above?

@feynmanliang
Copy link
Contributor Author

I played with it this morning. The bugs were occurring because ids = List(); apparently Breeze calls dgemv with an invalid LDA parameter when you row-index the matrix with an empty list.

Since ids = List() implies stat(::, ids) doesn't update, I surrounded the code with a conditional to fix the problem. We should probably investigate why some documents have zero terms though...

@jkbradley
Copy link
Member

Oh, I see. Thanks for investigating! In my example, the numbers of terms is limited to 10 (so I could print the topics), probably making some documents empty.

This LGTM pending tests, but can you please make a starter JIRA for adding a unit test which tests online LDA with empty documents? You may need to note in it that only SparseVectors can be empty. Thanks.

@feynmanliang
Copy link
Contributor Author

Ran some local perf tests.

bin/run-example mllib.LDAExample docs/*.md --maxIterations 100 --algorithm online --vocabSize 100 --k 3

Training Time (sec):

Before After
5.737 4.870
5.510 4.671
5.718 4.689
bin/run-example mllib.LDAExample docs/*.md --maxIterations 100 --algorithm online --vocabSize 100 --k 20

Training Time (sec):

Before After
7.020 6.311
6.977 6.564
7.485 6.088

@feynmanliang feynmanliang changed the title [MLlib]OnlineLDA Performance Improvements [SPARK-9224][MLlib]OnlineLDA Performance Improvements Jul 21, 2015
@SparkQA
Copy link

SparkQA commented Jul 21, 2015

Test build #37968 has finished for PR 7454 at commit 78b0f5a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Merging with master. Thanks!

@asfgit asfgit closed this in 8486cd8 Jul 22, 2015
@feynmanliang feynmanliang deleted the OnlineLDA-perf-improvements branch July 22, 2015 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants