Automatic parallelism through macro dataflow in high-level array languages (English)

Ratnalikar, Pushkar / Chauhan, Arun

In: 2014 23rd International Conference on Parallel Architecture and Compilation (PACT) ; 489-490 ; 2014

ISBN:

978-1-4503-2809-8

Conference paper / Electronic Resource

How to get this title?

Check access

LUH Campus collection

Download

Commercial Copyright fee: €30.47 Basic fee: €4.00 Total price: €34.47

Academic Copyright fee: €30.47 Basic fee: €2.00 Total price: €32.47

Export, share and cite

Dataflow computation is a powerful paradigm for parallel computing that is especially attractive on modern machines with multiple avenues for parallelism. However, adopting this model has been challenging as neither hardware-nor language-based approaches have been successful, except, in specialized contexts. We argue that general-purpose array languages, such as MATLAB, are good candidates for automatic translation to macro dataflow-style execution, where each array operation naturally maps to a macro dataflow operation and the model can be efficiently executed on contemporary multicore architecture. We support our argument with a fully automatic compilation technique to translate MATLAB programs to dynamic dataflow graphs that are capable of handling unbounded structured control flow. These graphs can be executed on multicore machines in an event driven fashion with the help of a runtime system built on top of Intel's Threading Building Blocks (TBB). By letting each task itself be data parallel, we are able to leverage existing data-parallel libraries and utilize parallelism at multiple levels. Our experiments on a set of benchmarks show speedups of up to 18× using our approach, over the original data-parallel code on a machine with two 16-core processors.

Title:

Automatic parallelism through macro dataflow in high-level array languages
Contributors:

Ratnalikar, Pushkar ( author ) / Chauhan, Arun ( author )
Published in:

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) ; 489-490
Publisher:

IEEE

Publication date:

2014-08-01
Size:

303949 byte
ISBN:

978-1-4503-2809-8
DOI:

https://doi.org/10.1145/2628071.2628131
Type of media:

Conference paper
Type of material:

Electronic Resource
Language:

English
Source:

IEEE

Table of contents conference proceedings

The tables of contents are generated automatically and are based on the data records of the individual contributions available in the index of the TIB portal. The display of the Tables of Contents may therefore be incomplete.

1: Keynote: Internet of mobile things: Challenges and opportunities
Nahrstedt, Klara et al. | 2014
digital version
3: Virtues and limitations of commodity hardware transactional memory
Diegues, Nuno / Romano, Paolo / Rodrigues, Luis et al. | 2014
digital version
15: Cooperative cache scrubbing
Sartor, Jennifer B. / Heirman, Wim / Blackburn, Stephen M. / Eeckhout, Lieven / McKinley, Kathryn S. et al. | 2014
digital version
27: KLA: A new algorithmic paradigm for parallel graph computations
Harshvardhan, / Fidel, Adam / Amato, Nancy M. / Rauchwerger, Lawrence et al. | 2014
digital version
39: Tiling and optimizing time-iterated computations over periodic domains
Bondhugula, Uday / Bandishti, Vinayaka / Cohen, Albert / Potron, Guillain / Vasilache, Nicolas et al. | 2014
digital version
51: ATCache: Reducing DRAM cache latency via a small SRAM tag cache
Huang, Cheng-Chieh / Nagarajan, Vijay et al. | 2014
digital version
61: SpongeDirectory: Flexible sparse directories utilizing multi-level memristors
Zhang, Lunkai / Strukov, Dmitri / Saadeldeen, Hebatallah / Fan, Dongrui / Zhang, Mingzhe / Franklin, Diana et al. | 2014
digital version
75: EFetch: Optimizing instruction fetch for event-driven web applications
Chadha, Gaurav / Mahlke, Scott / Narayanasamy, Satish et al. | 2014
digital version
87: XStream: Cross-core spatial streaming based MLC prefetchers for parallel applications in CMPs
Panda, Biswabandan / Balachandran, Shankar et al. | 2014
digital version
99: What is the cost of weak determinism?
Segulja, Cedomir / Abdelrahman, Tarek S. et al. | 2014
digital version
113: ILP and TLP in shared memory applications: A limit study
Fatehi, Ehsan / Gratz, Paul V. et al. | 2014
digital version
127: Versatile and scalable parallel histogram construction
Jung, Wookeun / Park, Jongsoo / Lee, Jaejin et al. | 2014
digital version
139: Bitwise data parallelism in regular expression matching
Cameron, Robert D. / Shermer, Thomas C. / Shriraman, Arrvindh / Herdy, Kenneth S. / Lin, Dan / Hull, Benjamin R. / Lin, Meng et al. | 2014
digital version
151: Adaptive heterogeneous scheduling for integrated GPUs
Kaleem, Rashid / Barik, Rajkishore / Shpeisman, Tatiana / Hu, Chunling / Lewis, Brian T. / Pingali, Keshav et al. | 2014
digital version
163: Warp-aware trace scheduling for GPUs
Jablin, James A. / Jablin, Thomas B. / Mutlu, Onur / Herlihy, Maurice et al. | 2014
digital version
175: CAWS: Criticality-aware warp scheduling for GPGPU workloads
Lee, Shin-Ying / Wu, Carole-Jean et al. | 2014
digital version
187: Invyswell: A hybrid transactional memory for Haswell's restricted transactional memory
Calciu, Irina / Gottschlich, Justin / Shpeisman, Tatiana / Herlihy, Maurice / Pokam, Gilles et al. | 2014
digital version
201: Consolidated conflict detection for hardware transactional memory
Zhao, Lihang / Draper, Jeffrey et al. | 2014
digital version
213: DeSTM: Harnessing determinism in STMs for application development
Pande, Santosh / Gavrilovska, Ada / Ravichandran, Kaushik et al. | 2014
digital version
225: PATS: Pattern aware scheduling and power gating for GPGPUs
Xu, Qiumin / Annavaram, Murali et al. | 2014
digital version
237: Heterogeneous microarchitectures trump voltage scaling for low-power cores
Lukefahr, Andrew / Padmanabha, Shruti / Das, Reetuparna / Dreslinski, Ronald / Wenisch, Thomas F. / Mahlke, Scott et al. | 2014
digital version
251: RCS: Runtime resource and core scaling for power-constrained multi-core processors
Ghasemi, Hamid Reza / Kim, Nam Sung et al. | 2014
digital version
263: Realm: An event-based low-level runtime for distributed memory architectures
Aiken, Alex / Bauer, Michael / Treichler, Sean et al. | 2014
digital version
277: kMAF: Automatic kernel-level management of thread and data affinity
Diener, Matthias / Cruz, Eduardo H. M. / Navaux, Philippe O. A. / Busse, Anselm / Heis, Hans-Ulrich et al. | 2014
digital version
289: Shuffling: A framework for lock contention aware thread scheduling for multicore multiprocessor systems
Kumar, Kishore / Rajiv, Pusukuri / Laxmi, Gupta / Bhuyan, N. et al. | 2014
digital version
301: Keynote: Domain-specific models for innovation in analytics
Blainey, Bob et al. | 2014
digital version
303: OpenTuner: An extensible framework for program autotuning
Ansel, Jason / Kamil, Shoaib / Veeramachaneni, Kalyan / Ragan-Kelley, Jonathan / Bosboom, Jeffrey / O'Reilly, Una-May / Amarasinghe, Saman et al. | 2014
digital version
317: Velociraptor: An embedded compiler toolkit for numerical programs targeting CPUs and GPUs
Garg, Rahul / Hendren, Laurie et al. | 2014
digital version
331: Memory scheduling towards high-throughput cooperative heterogeneous computing
Wang, Hao / Singh, Ripudaman / Schulte, Michael J. / Kim, Nam Sung et al. | 2014
digital version
343: Bounded memory scheduling of dynamic task graphs
Sbirlea, Dragos / Budimlic, Zoran / Sarkar, Vivek et al. | 2014
digital version
357: Trading cache hit rate for memory performance
Ding, Wei / Kandemir, Mahmut / Guttman, Diana / Jog, Adwait / Das, Chita R. / Yedlapalli, Praveen et al. | 2014
digital version
369: Compiler support for selective page migration in NUMA architectures
Piccoli, Guilherme / Santos, Henrique N. / Rodrigues, Raphael E. / Pousa, Christiane / Borin, Edson / Magno, Fernando et al. | 2014
digital version
381: COLORIS: A dynamic cache partitioning system using page coloring
Ye, Ying / West, Richard / Cheng, Zhuoqun / Li, Ye et al. | 2014
digital version
393: PEMOGEN: Automatic adaptive performance modeling during program runtime
Bhattacharyya, Arnamoy / Hoefler, Torsten et al. | 2014
digital version
405: ArrayTool: A lightweight profiler to guide array regrouping
Liu, Xu / Sharma, Kamal / Mellor-Crummey, John et al. | 2014
digital version
417: Design for scalability in enterprise SSDs
Tavakkol, Arash / Arjomand, Mohammad / Sarbazi-Azad, Hamid et al. | 2014
digital version
431: D²MA: Accelerating coarse-grained data transfer for GPUs
Jamshidi, D. Anoushe / Samadi, Mehrzad / Mahlke, Scott et al. | 2014
digital version
443: VAST: The illusion of a large memory space for GPUs
Lee, Janghaeng / Samadi, Mehrzad / Mahlke, Scott et al. | 2014
digital version
455: Automatic optimization of thread-coarsening for graphics processors
Magni, Alberto / Dubach, Christophe / O'Boyle, Michael et al. | 2014
digital version
467: Automatic execution of single-GPU computations across multiple GPUs
Cabezas, Javier / Vilanova, Lluis / Geladeno, Isaac / Jablin, Thomas B. / Navarro, Nacho / Hwu, Wen-mei et al. | 2014
digital version
469: LCA: A memory link and cache-aware co-scheduling approach for CMPs
Haritatos, Alexandros-Herodotos / Goumas, Georgios / Anastopoulos, Nikos / Nikas, Konstantinos / Kourtis, Kornilios / Koziris, Nectarios et al. | 2014
digital version
471: A run-time power manager exploiting software parallelism
Holmbacka, Simon / Lafond, Sebastien / Lilius, Johan et al. | 2014
digital version
473: Graph-based performance accounting for chip multiprocessor memory systems
Jahre, Magnus et al. | 2014
digital version
475: SQRL: Hardware accelerator for collecting software data structures
Kumar, Snehasish / Shriraman, Arrvindh / Srinivasan, Vijayalakshmi / Lin, Dan / Phillips, Jordon et al. | 2014
digital version
477: Optimizing stencil code via locality of computation
Luo, Yulong / Tan, Guangming et al. | 2014
digital version
479: ADHA: Automatic data layout framework for heterogeneous architectures
Majeti, Deepak / Meel, Kuldeep S. / Barik, Rajkishore / Sarkar, Vivek et al. | 2014
digital version
481: Active learning accelerated automatic heuristic construction for parallel program mapping
Ogilvie, William F. / Petoumenos, Pavlos / Wang, Zheng / Leather, Hugh et al. | 2014
digital version
483: Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels
Pai, Sreepathi / Govindarajan, R. / Thazhuthaveetil, Matthew J. et al. | 2014
digital version
485: Using STT-RAM to enable energy-efficient near-threshold chip multiprocessors
Pan, Xiang / Teodorescu, Radu et al. | 2014
digital version
487: Protection and utilization in shared cache through rationing
Parihar, Raj / Brock, Jacob / Ding, Chen / Huang, Michael C. et al. | 2014
digital version
489: Automatic parallelism through macro dataflow in high-level array languages
Ratnalikar, Pushkar / Chauhan, Arun et al. | 2014
digital version
491: A runtime support mechanism for fast mode switching of a self-morphing core for power efficiency
Srinivasan, Sudarshan / Kurella, Nithesh / Koren, Israel / Kundu, Sandip / Rodrigues, Rance et al. | 2014
digital version
493: Rollback-free value prediction with approximate loads
Thwaites, Bradley / Pekhimenko, Gennady / Esmaeilzadeh, Hadi / Yazdanbakhsh, Amir / Park, Jongse / Mururu, Girish / Mutlu, Onur / Mowry, Todd et al. | 2014
digital version
495: Measuring flexibility in single-ISA heterogeneous processors
Tomusk, Erik / Dubach, Christophe / O'Boyle, Michael et al. | 2014
digital version
497: SM-centric transformation: Circumventing hardware restrictions for flexible GPU scheduling
Wu, Bo / Chen, Guoyang / Li, Dong / Shen, Xipeng / Vetter, Jeffrey S. et al. | 2014
digital version
499: An event-based language for dynamic binary translation frameworks
Makarov, Serguei / Brown, Angela Demke / Goel, Ashvin et al. | 2014
digital version
501: Improving performance of streaming applications with filtering and control messages
Li, Peng / Buhler, Jeremy et al. | 2014
digital version
503: Stratified sampling for even workload partitioning
Paudel, Jeeva / Amaral, Jose Nelson et al. | 2014
digital version
505: Design of a hybrid MPI-CUDA benchmark suite for CPU-GPU clusters
Agarwal, Tejaswi / Becchi, Michela et al. | 2014
digital version
507: Data remapping for an energy efficient burst chop in DRAM memory systems
Jagathrakshakan, Sudharsan / Tavva, Venkata Kalyan / Mutyam, Madhu et al. | 2014
digital version
509: Data-reuse optimizations for pipelined tiling with parametric tile sizes
Isoard, Alexandre et al. | 2014
digital version
511: From petascale to the pocket: Adaptively scaling parallel programs for mobile SoCs
Fidel, Adam / Amato, Nancy M. / Rauchwerger, Lawrence et al. | 2014
digital version
513: Coarrays in GNU Fortran
Fanfarillo, Alessandro / Burnus, Tobias / Cardellini, Valeria / Filippone, Salvatore / Nagle, Dan / Rouson, Damian et al. | 2014
digital version
515: Locality-aware memory association for multi-target worksharing in OpenMP
Scogland, Thomas R. W. / Feng, Wu-Chun et al. | 2014
digital version
517: Processing big data graphs on memory-restricted systems
Harshvardhan, / Amato, Nancy M. / Rauchwerger, Lawrence et al. | 2014
digital version
519: Author index
| 2014
digital version
i: Front matters
| 2014
digital version

How to get this title?

Check access

LUH Campus collection

Download

Commercial Copyright fee: €30.47 Basic fee: €4.00 Total price: €34.47

Academic Copyright fee: €30.47 Basic fee: €2.00 Total price: €32.47

Quicklinks

Borrowing & Ordering

Quicklinks

Search & discover

Quicklinks

Learning & working

Quicklinks

Publishing & Archiving

Quicklinks

About the TIB

Quicklinks

Research & Development

Automatic parallelism through macro dataflow in high-level array languages (English)

How to get this title?

Export, share and cite

More details on this result

Table of contents

Table of contents conference proceedings

Similar titles

How to get this title?

Export, share and cite