EzDevInfo.com

spring-batch

Spring Batch is a framework for writing offline and batch applications using Spring and Java Spring Batch

http://projects.spring.io/spring-batch/

Spring-batch @BeforeStep does not work with @StepScope

I'm using Spring Batch version 2.2.4.RELEASE I tried to write a simple example with stateful ItemReader, ItemProcessor and ItemWriter beans.

public class StatefulItemReader implements ItemReader<String> {

    private List<String> list;

    @BeforeStep
    public void initializeState(StepExecution stepExecution) {
        this.list = new ArrayList<>();
    }

    @AfterStep
    public ExitStatus exploitState(StepExecution stepExecution) {
        System.out.println("******************************");
        System.out.println(" READING RESULTS : " + list.size());

        return stepExecution.getExitStatus();
    }

    @Override
    public String read() throws Exception {
        this.list.add("some stateful reading information");
        if (list.size() < 10) {
            return "value " + list.size();
        }
        return null;
    }
}

In my integration test, I'm declaring my beans in an inner static java config class like the one below:

@ContextConfiguration
@RunWith(SpringJUnit4ClassRunner.class)
public class SingletonScopedTest {

    @Configuration
    @EnableBatchProcessing
    static class TestConfig {
        @Autowired
        private JobBuilderFactory jobBuilder;
        @Autowired
        private StepBuilderFactory stepBuilder;

        @Bean
        JobLauncherTestUtils jobLauncherTestUtils() {
            return new JobLauncherTestUtils();
        }

        @Bean
        public DataSource dataSource() {
            EmbeddedDatabaseBuilder embeddedDatabaseBuilder = new EmbeddedDatabaseBuilder();
            return embeddedDatabaseBuilder.addScript("classpath:org/springframework/batch/core/schema-drop-hsqldb.sql")
                    .addScript("classpath:org/springframework/batch/core/schema-hsqldb.sql")
                    .setType(EmbeddedDatabaseType.HSQL)
                    .build();
        }

        @Bean
        public Job jobUnderTest() {
            return jobBuilder.get("job-under-test")
                    .start(stepUnderTest())
                    .build();
        }

        @Bean
        public Step stepUnderTest() {
            return stepBuilder.get("step-under-test")
                    .<String, String>chunk(1)
                    .reader(reader())
                    .processor(processor())
                    .writer(writer())
                    .build();
        }

        @Bean
        public ItemReader<String> reader() {
            return new StatefulItemReader();
        }

        @Bean
        public ItemProcessor<String, String> processor() {
            return new StatefulItemProcessor();
        }

        @Bean
        public ItemWriter<String> writer() {
            return new StatefulItemWriter();
        }
    }

    @Autowired
    JobLauncherTestUtils jobLauncherTestUtils;

    @Test
    public void testStepExecution() {
        JobExecution jobExecution = jobLauncherTestUtils.launchStep("step-under-test");

        assertEquals(ExitStatus.COMPLETED, jobExecution.getExitStatus());
    }
}

This test passes.

But as soon as I define my StatefulItemReader as a step scoped bean (which is better for a stateful reader), the "before step" code is no longer executed.

...
    @Bean
    @StepScope
    public ItemReader<String> reader() {
        return new StatefulItemReader();
    }
...

And I notice the same issue with processor and my writer beans.

What's wrong with my code? Is it related to this resolved issue: https://jira.springsource.org/browse/BATCH-1230

My whole Maven project with several JUnit tests can be found on GitHub: https://github.com/galak75/spring-batch-step-scope

Thank you in advance for your answers.

Source: (StackOverflow)

Microsoft equivalent of Spring Batch

We are designing an architecture that requires a batch framework.

On the Java side we are considering Spring Batch.

We are wondering what are the equivalent tools / frameworks in the Microsoft stack?

Source: (StackOverflow)

Spring Batch: Which ItemReader implementation to use for high volume & low latency

Use case: Read 10 million rows [10 columns] from database and write to a file (csv format).

Which ItemReader implementation among JdbcCursorItemReader & JdbcPagingItemReader would be suggested? What would be the reason?
Which would be better performing (fast) in the above use case?
Would the selection be different in case of a single-process vs multi-process approach?
In case of a multi-threaded approach using TaskExecutor, which one would be better & simple?

Source: (StackOverflow)

How can we share data between the different steps of a Job in Spring Batch?

Digging into Spring Batch, I'd like to know as to How can we share data between the different steps of a Job?

Can we use JobRepository for this? If yes, how can we do that?

Is there any other way of doing/achieving the same?

Thanks, Karan

Source: (StackOverflow)

How can I get started with Spring Batch? [closed]

I'm trying to learn Spring Batch, but the startup guide is very confusing. Comments like

You can get a pretty good idea about how to set up a job by examining the unit tests in the org.springframework.batch.sample package (in src/main/java) and the configuration in src/main/resources/jobs.

aren't exactly helpful. Also I find the Sample project very complicated (17 non-empty Namespaces with 109 classes)! Is there a simpler place to get started with Spring Batch?

Source: (StackOverflow)

How to get access to job parameters from ItemReader, in Spring Batch?

This is part of my job.xml:

<job id="foo" job-repository="job-repository">
  <step id="bar">
    <tasklet transaction-manager="transaction-manager">
      <chunk commit-interval="1"
        reader="foo-reader" writer="foo-writer"
      />
    </tasklet>
  </step>
</job>

This is the item reader:

import org.springframework.batch.item.ItemReader;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Component;
@Component("foo-reader")
public final class MyReader implements ItemReader<MyData> {
  @Override
  public MyData read() throws Exception {
    //...
  }
  @Value("#{jobParameters['fileName']}")
  public void setFileName(final String name) {
    //...
  }
}

This is what Spring Batch is saying in runtime:

Field or property 'jobParameters' cannot be found on object of 
type 'org.springframework.beans.factory.config.BeanExpressionContext'

What's wrong here? Where I can read more about these mechanisms in Spring 3.0?

Source: (StackOverflow)

How Spring Boot run batch jobs

I followed this sample for Spring Batch with Boot: https://github.com/spring-projects/spring-boot/blob/master/spring-boot-samples/spring-boot-sample-batch/src/main/java/sample/batch/SampleBatchApplication.java

When you run the main method the job is executed. This way I can't figure out how one can control the job execution. For example how you schedule a job, or get access to the job execution, or set job parameters.

I tried to register my own JobLauncher

    @Bean
    public JobLauncher jobLauncher(JobRepository jobRepo){
        SimpleJobLauncher simpleJobLauncher = new SimpleJobLauncher();
        simpleJobLauncher.setJobRepository(jobRepo);
        return simpleJobLauncher;
    }

but when i try to use it in the main method:

public static void main(String[] args) {
    ConfigurableApplicationContext ctx = SpringApplication.run(Application.class, args);    
    JobLauncher jobLauncher = ctx.getBean(JobLauncher.class);
    //try catch removed for readability
    jobLauncher.run(ctx.getBean(Job.class), new JobParameters());   
}

The job is again xecuted when the context is loaded and i got JobInstanceAlreadyCompleteException when i try to run it manually. Is there a way to prevet the automatic job execution?

Source: (StackOverflow)

MapReduce/Aggregate operations in SpringBatch

Is it possible to do MapReduce style operations in SpringBatch?

I have two steps in my batch job. The first step calculates average. The second step compares each value with average to determine another value.

For example, Lets say i have a huge database of Student scores. The first step calculates average score in each course/exam. The second step compares individual scores with average to determine grade based on some simple rule:

A if student scores above average
B if student score is Average
C if student scores below average

Currently my first step is a Sql which selects average and writes it to a table. Second step is a Sql which joins average scores with individual scores and uses a Processor to implement the rule.

There are similar aggregation functions like avg, min used a lot in Steps and I'd really prefer if this can be done in Processors keeping the Sqls as simple as possible. Is there any way to write a Processor which aggregates results across multiple rows based on a grouping criteria and then Writes Average/Min to the Output table once?

This pattern repeats a lot and i'm not looking for a Single processor implementation using a Sql which fetches both average and individual scores.

Source: (StackOverflow)

Need understanding of spring.handlers and spring.schemas

I have some questions derived from a problem that I have already solved through this other question. However, I am still wondering about the root cause. My questions are as follows:

What is the purpose of spring.handlers and spring.schemas?

As I understand it's a way of telling the Spring Framework where to locate the xsd so that everything is wired and loaded correctly. But...

Under what circumstances should I have those two files under the META-INF folder?
In my other question linked above, does anybody know why I had to add the maven-shade-plugin to create those two files (based on all my dependencies) under META-INF? In other words, what was the ROOT CAUSE that made me have to use the maven shade plugin?

Source: (StackOverflow)

Batch processing and functional programming

As a Java developer, I'm used to use Spring Batch for batch processing, generally using a streaming library to export large XML files with StAX for exemple.

I'm now developping a Scala application, and wonder if there's any framework, tool or guideline to achieve batch processing.

My Scala application uses the Cake Pattern and I'm not sure how I could integrate this with SpringBatch. Also, I'd like to follow the guidelines described in Functional programming in Scala and try to keep functional purity, using stuff like the IO monad...

I know this is kind of an open question, but I never read anything about this...

Has anyone already achieved functional batch processing here? How was it working? Am I supposed to have a main that creates a batch processing operation in an IO monad and run it? Is there any tool or guideline to help, monitor or handle restartability, like we use Spring Batch in Java. Do you use Spring Batch in Scala? How do you handle the integration part, for exemple waiting for a JMS/AMQP message to start the treatment that produces an XML?

Any feedback on the subjet is welcome

Source: (StackOverflow)

Ways to reduce memory churn

Background

I have a Spring batch program that reads a file (example file I am working with is ~ 4 GB in size), does a small amount of processing on the file, and then writes it off to an Oracle database.

My program uses 1 thread to read the file, and 12 worker threads to do the processing and database pushing.

I am churning lots and lots and lots of young gen memory, which is causing my program to go slower than I think it should.

Setup

JDK 1.6.18
Spring batch 2.1.x
4 Core Machine w 16 GB ram

-Xmx12G 
-Xms12G 
-NewRatio=1 
-XX:+UseParallelGC
-XX:+UseParallelOldGC

Problem

With these JVM params, I get somewhere around 5.x GB of memory for Tenured Generation, and around 5.X GB of memory for Young Generation.

In the course of processing this one file, my Tenured Generation is fine. It grows to a max of maybe 3 GB, and I never need to do a single full GC.

However, the Young Generation hits it's max many times. It goes up to 5 GB range, and then a parallel minor GC happens and clears Young Gen down to 500MB used. Minor GCs are good and better than a full GC, but it still slows down my program a lot (I am pretty sure the app still freezes when a young gen collection occurs, because I see the database activity die off). I am spending well over 5% of my program time frozen for minor GCs, and this seems excessive. I would say over the course of processing this 4 GB file, I churn through 50-60GB of young gen memory.

I don't see any obvious flaws in my program. I am trying to obey the general OO principles and write clean Java code. I am trying not to create objects for no reason. I am using thread pools, and whenever possible passing objects along instead of creating new objects. I am going to start profiling the application, but I was wondering if anyone had some good general rules of thumb or anti patterns to avoid that lead to excessive memory churn? Is 50-60GB of memory churn to process a 4GB file the best I can do? Do I have to revert to JDk 1.2 tricks like Object Pooling? (although Brian Goetz give a presentation that included why object pooling is stupid, and we don't need to do it anymore. I trust him a lot more than I trust myself .. :) )

Source: (StackOverflow)

Synchronizing table data across databases

I have one table that records its row insert/update timestamps on a field.

I want to synchronize data in this table with another table on another db server. Two db servers are not connected and synchronization is one way (master/slave). Using table triggers is not suitable

My workflow:

I use a global last_sync_date parameter and query table Master for the changed/inserted records
Output the resulting rows to xml
Parse the xml and update table Slave using updates and inserts

The complexity of the problem rises when dealing with deleted records of Master table. To catch the deleted records I think I have to maintain a log table for the previously inserted records and use sql "NOT IN". This becomes a performance problem when dealing with large datasets.

What would be an alternative workflow dealing with this scenario?

Source: (StackOverflow)

How does Spring Batch manage transactions (with possibly multiple datasources)?

I would like some information about the data flow in a Spring Batch processing but fail to find what I am looking for on the Internet (despite some useful questions on this site).

I am trying to establish standards to use Spring Batch in our company and we are wondering how Spring Batch behaves when several processors in a step updates data on different data sources.

This question focuses on a chunked process but feel free to provide information on other modes.

From what I have seen (please correct me if I am wrong), when a line is read, it follows the whole flow (reader, processors, writer) before the next is read (as opposed to a silo-processing where reader would process all lines, send them to the processor, and so on).

In my case, several processors read data (in different databases) and updates them in the process, and finally the writer inserts data into yet another DB. For now, the JobRepository is not linked to a database, but that would be an independent one, making the thing still a bit more complex.

This model cannot be changed since the data belongs to several business areas.

How is the transaction managed in this case? Is the data committed only once the full chunk is processed? And then, is there a 2-phase commit management? How is it ensured? What development or configuration should be made in order to ensure the consistency of data?

More generally, what would your recommendations be in a similar case?

Source: (StackOverflow)

Spring: Getting FactoryBean object instead of FactoryBean.getObject()

Short question: If I have class that impelemnts FactoryBean interface, how can I get from FactoryBean object itself instead of FactoryBean.getObject()?

Long question: I have to use 3-rd party Spring based library which is hardly use FactoryBean interface. Right now I always must configure 2 beans:

<!-- Case 1-->
<bean id="XYZ" class="FactoryBean1" scope="prototype">
	<property name="steps">
		<bean class="FactoryBean2">
			<property name="itemReader" ref="aName"/>
		</bean>
	</property>
</bean>

<bean id="aName" class="com.package.ClassName1" scope="prototype">
	<property name="objectContext">
		<bean class="com.package.ABC"/>
	</property>
</bean>

<!-- Case 2-->
<bean id="XYZ2" class="FactoryBean1" scope="prototype">
	<property name="steps">
		<bean class="FactoryBean2">
			<property name="itemReader" ref="aName2"/>
		</bean>
	</property>
</bean>

<bean id="aName2" class="com.package.ClassName1" scope="prototype">
	<property name="objectContext">
		<bean class="com.package.QWE"/>
	</property>
</bean>

Actyually defintion of a bean with name "XYZ" (compare with "XYZ2") never will be changed, but because of factory nature I must copy the code for each configuration. Definition of a bean with name "aName" always will be new (i.e. each configuration will have own objectContext value).

I would like to simplify the configuration have a single factory bean (remove "XYZ2" and rid of link to "aName"):

<bean id="XYZ" class="FactoryBean1" scope="prototype">
	<property name="steps">
		<bean class="FactoryBean2"/>
	</property>
</bean>

<bean id="aName" class="com.package.ClassName1" scope="prototype">
	<property name="objectContext">
		<bean class="com.package.ABC"/>
	</property>
</bean>


<bean id="aName2" class="com.package.ClassName1" scope="prototype">
	<property name="objectContext">
		<bean class="com.package.QWE"/>
	</property>
</bean>

Unfortunately, it's not as simple as I expect. I suppose to glue factory (i.e. XYZ bean from the example) with necessary objects (i.e. "aName", "aName2") at runtime. The approach doesn't work because when I ask Spring for FactoryBean object it returns to me FactoryBean.getObject() which impossible to instanciate at that time because of missing itemReader value.

I hope that SpringSource foresee my case I can somehome "hook" FactoryBean.getObject() call to provide all necessary properties at runtime.

Another complexity that disturb me a bit it's chains of Factories (Factory1 get an object from Factory2 that I have to "hook" at runtime).

Any ideas will be appreciated.

Source: (StackOverflow)

building spring batch sample application

I am trying to build the sample application for spring batch 2.1.6. (ie. spring-batch-2.1.6.RELEASE/samples/spring-batch-samples) using maven but am getting this error for a missing plugin:

[ERROR] Plugin com.springsource.bundlor:com.springsource.bundlor.maven:1.0.0.RELEASE or one of its dependencies could not be resolved: Failure to find com.springsource.bundlor:com.springsource.bundlor.maven:jar:1.0.0.RELEASE in http://repo1.maven.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced ->

Is there another repository I can set up to get this plugin? I am a bit suprised to be getting this errror as this is the latest realease version of spring batch.

Here is the repository section from the pom as it came in the download:

<repositories>
    <repository>
        <id>com.springsource.repository.bundles.external</id>
        <name>SpringSource Enterprise Bundle Repository - SpringSource Bundle External</name>
        <url>http://repository.springsource.com/maven/bundles/external</url>
        <snapshots>
            <enabled>false</enabled>
        </snapshots>
    </repository>
</repositories>

Source: (StackOverflow)