Job Description:

I built an ETL pipeline to process terabytes of data. To achieve that goal, I setup a Spark Cluster (Scala) and MinIO server for object data storage.

I can process and save 200 gigabytes in roughly 30 minutes using 10 virtual machines, for Spark Processing.

The issue I have is that I am not able to scale that Processing. Meaning if I double the number of spark virtual machines, this does not affect processing time.

I need a Data Architect who has enough expertise to help me identify the bottleneck and fix the issue.


• I use virtual machines set up on-premises using VMWare ESXi 6

• Physical machines (which host VMs) are on a 1 GB network.

• There is no over commitment for vCPU nor RAM

• Spark VMs. 16VCPU, 64 GB RAM

• MinIO (Storage). 16vCPU, 64GB RAM, Configured using RAID0


The process is straight.

• Read data from 2 sources on MinIO,

• Make a Union of data of two sources,

• Filter out empty values on a column from resulting dataset,

• Apply 2 groupby on that column (We save intermediate values after the first groupby)

• Union the dataset obtained after the groupby operation with the empty columns values

• Save the whole again on MinIO

Compétences : VMware , Spark, Data Engineer, Amazon S3, Big Data

Concernant le client :
( 5 commentaires ) SAINT DENIS, France

Nº du projet : #35893478

5 freelances font une offre moyenne de 334 € pour ce travail


Hi there,I am excited to share my expertise and skills in data engineering and Big data, which I have acquired over the past 3 years. I am confident that I can meet your requirements. I would be delighted to work with Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% EUR en 5 jours
(1 Évaluation)

Hi there, How are you? I have gone through your project details. I would like to tell you that l have a great bunch of experience in VMware, Spark, Data Engineer, Big Data and Amazon S3. For that I would require from Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% EUR en 8 jours
(0 Commentaires)

Hi Saint Denis, I am a Data Engineer with 7+year of experience. I would like to offer you help to fix this issue. Please let me know if we can connect .

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% EUR en 7 jours
(0 Commentaires)

Hi, I hv ,,10 years of exp in this. I would like to work for you. As i have already did the similar task and supported many projects/person in the same way etc. I would like to hear from your side.  Thank you for

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% EUR en 7 jours
(1 Évaluation)

Hi, I am a data engineer of 5 years experience. I have designed and built large scale spark pipelines for use cases similar to yours. Unfortunately as you might be aware there are no straight forward answer to your pro Plus

%bids___i_sum_sub_35% %project_currencyDetails_sign_sub_36% EUR en 15 jours
(0 Commentaires)