Spark Study Notes

Spark sql catalyst optimizer – how to turn a query into a execution plan

https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

4 Stages from AST input:

  • Analysis – figure out the type for each attribute (1000 lines of code)
  • Local optimization – apply optimization rule to logic plan (the tree) (800 lines of code)
  • Physical planning – generate multiple physical plan based on logic plan. Select the best physical plan based on a cost model (500 lines of code)
  • Code generation

 

Spark RDD transformations:

  • intersection(otherDataset)
    • Dataset A = (1, 2, 3), Daset B = (3, 4, 5),
    • A.intersection(B) = (3)
  • join(otherDataset, [numTasks])  – similar to inner join
    • Dataset A = ((1, a), (2, b)), Daset B = ((1, x),  (3, y)),
    • A.join(B) = ( (1,(a,x)) )
  • cogroup(otherDataset, [numTasks]) – similar to full out join, except that the result for the same key are organized as iteratable collection
    • Dataset A = ((1, a), (1, b),  (2, c)), Daset B = ((1, x),  (3, y)),
    • A.cogroup(B) = ( (1, [a, b], [x])), (2, [c], [])), (3, [], [y]) )
  • cartesian(otherDataset)
    • Dataset A = (1, 2), Daset B = (x, y),
    • A.cartesian(B) = ((1, x), (1, y), (2, x), (2, y))
  • coalesce(numPartitions) – repartition downward while avoiding full shuffle
    • Dataset A:
      • partition 1 = (1, 2, 3)
      • partition 2 = (4, 5, 6)
      • partition 3 = (7, 8, 9)
      • partition 4 = (10, 11, 12)
    • A.coalesce(2)
      • partition 1 = (1, 2, 3, 7, 8, 9)
      • partition 2 = (4, 5, 6, 10, 11, 12)
  • repartition(numPartitions) – full shuffle repartition

 

Advertisements
Spark Study Notes

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s