Detailed Course Outline
1: Introduction to the parallel framework architecture • Describe the parallel processing architecture • Describe pipeline and partition parallelism • Describe the role of the configuration file • Design a job that creates robust test data
2: Compiling and executing jobs • Describe the main parts of the configuration file • Describe the compile process and the OSH that the compilation process generates • Describe the role and the main parts of the Score • Describe the job execution process
3: Partitioning and collecting data • Understand how partitioning works in the Framework • Viewing partitioners in the Score • Selecting partitioning algorithms • Generate sequences of numbers (surrogate keys) in a partitioned, parallel environment
4: Sorting data • Sort data in the parallel framework • Find inserted sorts in the Score • Reduce the number of inserted sorts • Optimize Fork-Join jobs • Use Sort stages to determine the last row in a group • Describe sort key and partitioner key logic in the parallel framework
5: Buffering in parallel jobs • Describe how buffering works in parallel jobs • Tune buffers in parallel jobs • Avoid buffer contentions
6: Parallel framework data types • Describe virtual data sets • Describe schemas • Describe data type mappings and conversions • Describe how external data is processed • Handle nulls • Work with complex data
7: Reusable components • Create a schema file • Read a sequential file using a schema • Describe Runtime Column Propagation (RCP) • Enable and disable RCP • Create and use shared containers
8: Balanced Optimization • Enable Balanced Optimization functionality in Designer • Describe the Balanced Optimization workflow • List the different Balanced Optimization options. • Push stage processing to a data source • Push stage processing to a data target • Optimize a job accessing Hadoop HDFS file system • Understand the limitations of Balanced Optimizations