7 results on '"Kaya, Kamer"'
Search Results
2. Parallel algorithms for computing sparse matrix permanents.
- Author
-
KAYA, Kamer
- Subjects
- *
SPARSE matrices , *PARALLEL algorithms , *PARALLEL programming , *SHIP captains - Abstract
The permanent is an important characteristic of a matrix and it has been used in many applications. Unfortunately, it is a hard to compute and hard to approximate the immanant. For dense/full matrices, the fastest exact algorithm, Ryser, has O(2n-1n complexity. In this work, a parallel algorithm, SkipPer, is proposed to exploit the sparsity within the input matrix as much as possible. SkipPer restructures the matrix to reduce the overall work, skips the unnecessary steps, and employs a coarse-grain, shared-memory parallelization with dynamic scheduling. The experiments show that SkipPer increases the performance of exact permanent computation up to 140x compared to the naive version for general matrices. Furthermore, thanks to the coarse-grain parallelization, 14-15x speedup on average is obtained with β = 16 threads over sequential SkipPer. Overall, by exploiting the sparsity and parallelism, the speedup is 2000x for some of the matrices used in the experimental setting. The proposed techniques in this paper can be used to significantly improve the performance of exact permanent computation by simply replacing Ryser with SkipPer, especially when the matrix is highly sparse. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
3. MULTILEVEL ALGORITHMS FOR ACYCLIC PARTITIONING OF DIRECTED ACYCLIC GRAPHS.
- Author
-
HERRMANN, JULIEN, YUSUF ÖZKAYA, M., UÇAR, BORA, KAYA, KAMER, and ÇATALYÜREK, ÜMIT V.
- Subjects
DIRECTED acyclic graphs ,PARALLEL algorithms ,UNDIRECTED graphs ,PHASE partition ,DIRECTED graphs ,PARTITION functions ,GEOMETRIC vertices - Abstract
We investigate the problem of partitioning the vertices of a directed acyclic graph into a given number of parts. The objective function is to minimize the number or the total weight of the edges having end points in different parts, which is also known as the edge cut. The standard load balancing constraint of having an equitable partition of the vertices among the parts should be met. Furthermore, the partition is required to be acyclic; i.e., the interpart edges between the vertices from different parts should preserve an acyclic dependency structure among the parts. In this work, we adopt the multilevel approach with coarsening, initial partitioning, and refinement phases for acyclic partitioning of directed acyclic graphs. We focus on two-way partitioning (sometimes called bisection), as this scheme can be used in a recursive way for multiway partitioning. To ensure the acyclicity of the partition at all times, we propose novel and efficient coarsening and refinement heuristics. The quality of the computed acyclic partitions is assessed by computing the edge cut. We also propose effective ways to use the standard undirected graph partitioning methods in our multilevel scheme. We perform a large set of experiments on a dataset consisting of (i) graphs coming from an application and (ii) some others corresponding to matrices from a public collection. We report significant improvements compared to the current state of the art. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
4. A New Method for Computational Private Information Retrieval.
- Author
-
TİLLEM, GAMZE, SAVAŞ, ERKAY, and KAYA, KAMER
- Subjects
INFORMATION retrieval ,PUBLIC key cryptography ,NUMBER theory ,PARALLEL algorithms ,EXPONENTIATION ,CRYPTOSYSTEMS - Abstract
Lipmaa's Computational Private Information Retrieval (CPIR) protocol is probably the most bandwidth efficient method in the literature, although its computational complexity is a limiting factor for practical applications as it is based on expensive public key operations. Utilizing binary decision diagrams (Bdd) and the Damgård–Jurik cryptosystem, Lipmaa's CPIR performs three modular exponentiation operations per internal node in Bdd. In this paper, we present a new CPIR protocol, which reduces the number of exponentiation operations to 1 per first-level internal nodes and 2 per other internal nodes of the Bdd. For 1024-bit exponents (i.e. 80-bit security level) and 32 768 items, when compared with the fastest parallel implementation in the literature on four cores, reducing the number of exponentiations yields a 1.22× speedup and the multiexponentiation technique adds 2.23× more on top of that. Overall, when combined, reducing the number of exponentiations, multi-exponentiation, parallelization on four cores and the hybrid approach can provide more than 300× speedup compared to the sequential implementation of the original method. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
5. A generic Private Information Retrieval scheme with parallel multi‐exponentiations on multicore processors.
- Author
-
Topcuoğlu, Cem, Kaya, Kamer, and Savaş, Erkay
- Subjects
MULTICORE processors ,INFORMATION retrieval ,INFORMATION services ,COMPUTER architecture ,BANDWIDTHS - Abstract
Summary: Private Information Retrieval (PIR) enables the data owners to share and/or retrieve data on remote repositories without leaking any information as to which a data item is requested. Although it is always possible to download the entire dataset, this is clearly a waste of bandwidth. A fundamental approach in the literature for PIR is exploiting homomorphic cryptosystems. In these approaches, not one but many modular exponentiations need to be computed and multiplied to obtain the desired result. This multi‐exponentiation operation can be implemented by exponentiating the bases to their corresponding exponents one‐by‐one. However, when the operation is considered as a whole, it can be performed in a more efficient way. Although individual exponentiations are pleasingly parallelizable, the combined multi‐exponentiation requires a careful parallel implementation. In this work, we propose a generic tensor‐based PIR scheme and efficient and novel techniques to parallelize multi‐exponentiations on multicore processors with perfect load balance. The experimental results show that our load balancing techniques make a parallel multi‐exponentiation up to %27 faster when the size of the bases and the exponents are 4096 bits and the number of threads is 16. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
6. Multicore and manycore parallelization of cheap synchronizing sequence heuristics.
- Author
-
Karahoda, Sertaç, Erenay, Osman Tufan, Kaya, Kamer, Türker, Uraz Cengiz, and Yenigün, Hüsnü
- Subjects
- *
MULTICORE processors , *GRAPHICS processing units , *FINITE state machines , *HEURISTIC , *PARALLEL processing , *MACHINE theory - Abstract
An important concept in finite state machine based testing is synchronization which is used to initialize an implementation to a particular state. Usually, synchronizing sequences are used for this purpose and the length of the sequence used is important since it determines the cost of the initialization process. Unfortunately, the shortest synchronization sequence problem is NP-Hard. Instead, heuristics are used to generate short sequences. However, the cubic complexity of even the fastest heuristic algorithms can be a problem in practice. In order to scale the performance of the heuristics for generating short synchronizing sequences, we propose algorithmic improvements together with a parallel implementation of the cheapest heuristics existing in the literature. To identify the bottlenecks of these heuristics, we experimented on random and slowly synchronizing automata. The identified bottlenecks in the algorithms are improved by using algorithmic modifications. We also implement the techniques on multicore CPUs and Graphics Processing Units (GPUs) to take benefit of the modern parallel computation architectures. The sequential implementation of the heuristic algorithms are compared to our parallel implementations by using a test suite consisting of 1200 automata. The speedup values obtained depend on the size and the nature of the automaton. In our experiments, we observe speedup values as high as 340x by using a 16-core CPU parallelization, and 496x by using a GPU. Furthermore, the proposed methods scale well and the speedup values increase as the size of the automata increases. • We propose algorithmic improvements together with an efficient parallelization approach for synchronizing sequence heuristics. • We experimented on random and slowly synchronizing automata. • We implement the proposed techniques on multicore CPUs and Graphics Processing Units (GPUs) to take benefit of the modern parallel architectures. • With the proposed techniques, we observe speedup values as high as 340x by using a 16-core CPU parallelization, and 496x by using a GPU. • The proposed methods scale well and the speedup values increase as the size of the automata increases. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
7. Boosting expensive synchronizing heuristics.
- Author
-
Saraç, N. Ege, Altun, Ömer Faruk, Atam, Kamil Tolga, Karahoda, Sertaç, Kaya, Kamer, and Yenigün, Hüsnü
- Subjects
- *
GRAPHICS processing units , *HEURISTIC , *MACHINE theory , *PARALLEL algorithms , *ROBOTS - Abstract
For automata, synchronization, the problem of bringing an automaton to a particular state regardless of its initial state, is important. It has several applications in practice and is related to a fifty-year-old conjecture on the length of the shortest synchronizing word. Although using shorter words increases the effectiveness in practice, finding a shortest one (which is not necessarily unique) is NP-hard. For this reason, there exist various heuristics in the literature. However, high-quality heuristics such as SynchroP producing relatively shorter sequences are very expensive and can take hours when the automaton has tens of thousands of states. The SynchroP heuristic has been frequently used as a benchmark to evaluate the performance of the new heuristics. In this work, we first improve the runtime of SynchroP and its variants by using algorithmic techniques. We then focus on adapting SynchroP for many-core architectures, and overall, we obtain more than 1000 × speedup on GPUs compared to naive sequential implementation that has been frequently used as a benchmark to evaluate new heuristics in the literature. We also propose two SynchroP variants and evaluate their performance. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.