Spring Batch - Process Multiple Files Parallel

Photo Credit Pixabay

Today, We will discuss how we can Process Multiple Files Concurrently using Spring Batch.

Prerequisite Basic knowledge of Spring & Spring Batch Framework is Required.

Background
Spring Batch is a lightweight, comprehensive batch framework designed to enable the development of robust batch applications vital for the daily operations of enterprise systems. Spring Batch builds upon the characteristics of the Spring Framework that people have come to expect (productivity, POJO-based development approach, and general ease of use), while making it easy for developers to access and leverage more advance enterprise services when necessary.

Scaling and Parallel Processing Spring Batch Offers Multiple options for Scaling and Parallel Processing. At very high level these are separated in below categories.

  1. Multi-threaded Step
  2. Parallel Steps
  3. Remote Chunking
  4. Partitioning

For Processing Multiple Files we will be using Partitioning.

Partitioning

The Job is executing on the left hand side as a sequence of Steps, and one of the Steps is labelled as a Master. The Slaves in this picture are all identical instances of a Step, which could in fact take the place of the Master resulting in the same outcome for the Job. The Slaves are typically going to be remote services, but could also be local threads of execution. The messages sent by the Master to the Slaves in this pattern do not need to be durable, or have guaranteed delivery: Spring Batch meta-data in the JobRepository will ensure that each Slave is executed once and only once for each Job execution.
If required, we can pass data from the master to the slave. The meta data (i.e. the JobRepository), makes sure that every slave is executed only once in a single execution of the Job.

Demo Application For processing Multiple Files Concurrently We will extend the Spring Batch Sample Application provided on Getting Stated guide Here

Sample Application : Sample application imports data from a CSV spreadsheet, transforms it with custom code, and stores the final results in a database. We will add the capability of Processing Multiple Files Concurrently Step by Step.

Defining Partitioner bean using MultiResourcePartitioner
MultiResourcePartitioner is Implementation of Partitioner that locates multiple resources and associates their file names with execution context keys. Creates an ExecutionContext per resource, and labels them as {partition0, partition1, …, partitionN}.

MultiResourcePartitioner Bean Configuration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
   @Bean("partitioner")
@StepScope
public Partitioner partitioner() {
log.info("In Partitioner");
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
ResourcePatternResolver resolver = new PathMatchingResourcePatternResolver();
Resource[] resources = null;
try {
resources = resolver.getResources("/*.csv");
} catch (IOException e) {
e.printStackTrace();
}
partitioner.setResources(resources);
partitioner.partition(10);
return partitioner;
}

Configuration of Master Step
Master Step & TaskExcecutor Configuration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
@Bean
@Qualifier("masterStep")
public Step masterStep() {
return stepBuilderFactory.get("masterStep")
.partitioner("step1", partitioner())
.step(step1())
.taskExecutor(taskExecutor())
.build();
}

@Bean
public ThreadPoolTaskExecutor taskExecutor() {
ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
taskExecutor.setMaxPoolSize(10);
taskExecutor.setCorePoolSize(10);
taskExecutor.setQueueCapacity(10);
taskExecutor.afterPropertiesSet();
return taskExecutor;
}

Binding Input Data to Steps: Passing File Name
This can be done using StepScope feature of Spring Batch.StepScope Allows the late binding.
We need to Read filename from the stepExecutionContext as shown below.
FlatFileItemReader Configuration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
   @Bean
@StepScope
@Qualifier("personItemReader")
public FlatFileItemReader<Person> personItemReader(@Value("#{stepExecutionContext['fileName']}") String filename) throws MalformedURLException {

log.info("In Reader");
return new FlatFileItemReaderBuilder<Person>()
.name("personItemReader")
.delimited()
.names(new String[] { "firstName", "lastName" })
.fieldSetMapper(new BeanWrapperFieldSetMapper<Person>() {
{
setTargetType(Person.class);
}
})
.resource(new UrlResource(filename))
.build();
}

Configuration of slave Step
Slave Step Configuration
1
2
3
4
5
6
7
8
9
   @Bean
public Step step1() {
return stepBuilderFactory.get("step1")
.<Person, Person>chunk(10)
.processor(processor())
.writer(writer)
.reader(personItemReader)
.build();
}

Now if we Launch the application, In logs we can see each file is getting process by separate thread.

References
Spring Batch getting Started
Scaling and Parallel Processing
Partitioning
Github Link of Solution

Share Comments