An Analysis of Software Parallelism in Big Data Technologies for Data-Intensive Architectures
Data-intensive architectures handle an enormous amount of information, which require the use of big data technologies. These tools include the parallelization mechanisms employed to speed up data processing. However, the increasing volume of these data has an impact on this parallelism and on resource usage. The strategy traditionally employed to increase the processing power has usually been that of adding more resources in order to exploit the parallelism; this strategy is, however, not always feasible in real projects, principally owing to the cost implied. The intention of this paper is, therefore, to analyze how this parallelism can be exploited from a software perspective, focusing specifically on whether big data tools behave as ideally expected: a linear increase in performance with respect to the degree of parallelism and the data load rate. Analysis is consequently carried out of, on the one hand, the impact of the internal data partitioning mechanisms of big data tools and, on the other, the impact on the performance of an increasing data load, while keeping the hardware resources constant. We have, therefore, conducted an experiment with two consolidated big data tools, Kafka and Elasticsearch. Our goal is to analyze the performance obtained when varying the degree of parallelism and the data load rate without ever reaching the limit of hardware resources available. The results of these experiments lead us to conclude that the performance obtained is far from being the ideal speedup, but that software parallelism still has a significant impact.