spring-batch
Spring Batch is a framework for writing offline and batch applications using Spring and Java
Spring Batch
I'm using Spring Batch version 2.2.4.RELEASE
I tried to write a simple example with stateful ItemReader, ItemProcessor and ItemWriter beans.
public class StatefulItemReader implements ItemReader<String> {
private List<String> list;
@BeforeStep
public void initializeState(StepExecution stepExecution) {
this.list = new ArrayList<>();
}
@AfterStep
public ExitStatus exploitState(StepExecution stepExecution) {
System.out.println("******************************");
System.out.println(" READING RESULTS : " + list.size());
return stepExecution.getExitStatus();
}
@Override
public String read() throws Exception {
this.list.add("some stateful reading information");
if (list.size() < 10) {
return "value " + list.size();
}
return null;
}
}
In my integration test, I'm declaring my beans in an inner static java config class like the one below:
@ContextConfiguration
@RunWith(SpringJUnit4ClassRunner.class)
public class SingletonScopedTest {
@Configuration
@EnableBatchProcessing
static class TestConfig {
@Autowired
private JobBuilderFactory jobBuilder;
@Autowired
private StepBuilderFactory stepBuilder;
@Bean
JobLauncherTestUtils jobLauncherTestUtils() {
return new JobLauncherTestUtils();
}
@Bean
public DataSource dataSource() {
EmbeddedDatabaseBuilder embeddedDatabaseBuilder = new EmbeddedDatabaseBuilder();
return embeddedDatabaseBuilder.addScript("classpath:org/springframework/batch/core/schema-drop-hsqldb.sql")
.addScript("classpath:org/springframework/batch/core/schema-hsqldb.sql")
.setType(EmbeddedDatabaseType.HSQL)
.build();
}
@Bean
public Job jobUnderTest() {
return jobBuilder.get("job-under-test")
.start(stepUnderTest())
.build();
}
@Bean
public Step stepUnderTest() {
return stepBuilder.get("step-under-test")
.<String, String>chunk(1)
.reader(reader())
.processor(processor())
.writer(writer())
.build();
}
@Bean
public ItemReader<String> reader() {
return new StatefulItemReader();
}
@Bean
public ItemProcessor<String, String> processor() {
return new StatefulItemProcessor();
}
@Bean
public ItemWriter<String> writer() {
return new StatefulItemWriter();
}
}
@Autowired
JobLauncherTestUtils jobLauncherTestUtils;
@Test
public void testStepExecution() {
JobExecution jobExecution = jobLauncherTestUtils.launchStep("step-under-test");
assertEquals(ExitStatus.COMPLETED, jobExecution.getExitStatus());
}
}
This test passes.
But as soon as I define my StatefulItemReader as a step scoped bean (which is better for a stateful reader), the "before step" code is no longer executed.
...
@Bean
@StepScope
public ItemReader<String> reader() {
return new StatefulItemReader();
}
...
And I notice the same issue with processor and my writer beans.
What's wrong with my code? Is it related to this resolved issue: https://jira.springsource.org/browse/BATCH-1230
My whole Maven project with several JUnit tests can be found on GitHub: https://github.com/galak75/spring-batch-step-scope
Thank you in advance for your answers.
Source: (StackOverflow)
We are designing an architecture that requires a batch framework.
On the Java side we are considering Spring Batch.
We are wondering what are the equivalent tools / frameworks in the Microsoft stack?
Source: (StackOverflow)
Use case: Read 10 million rows [10 columns] from database and write to a file (csv format).
Which ItemReader implementation among JdbcCursorItemReader & JdbcPagingItemReader would be suggested? What would be the reason?
Which would be better performing (fast) in the above use case?
Would the selection be different in case of a single-process vs multi-process approach?
In case of a multi-threaded approach using TaskExecutor, which one would be better & simple?
Source: (StackOverflow)
Digging into Spring Batch, I'd like to know as to How can we share data between the different steps of a Job?
Can we use JobRepository for this? If yes, how can we do that?
Is there any other way of doing/achieving the same?
Thanks,
Karan
Source: (StackOverflow)
I'm trying to learn Spring Batch, but the startup guide is very confusing. Comments like
You can get a pretty good idea about
how to set up a job by examining the
unit tests in the
org.springframework.batch.sample
package (in src/main/java) and the
configuration in
src/main/resources/jobs.
aren't exactly helpful.
Also I find the Sample project very complicated (17 non-empty Namespaces with 109 classes)! Is there a simpler place to get started with Spring Batch?
Source: (StackOverflow)
This is part of my job.xml
:
<job id="foo" job-repository="job-repository">
<step id="bar">
<tasklet transaction-manager="transaction-manager">
<chunk commit-interval="1"
reader="foo-reader" writer="foo-writer"
/>
</tasklet>
</step>
</job>
This is the item reader:
import org.springframework.batch.item.ItemReader;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Component;
@Component("foo-reader")
public final class MyReader implements ItemReader<MyData> {
@Override
public MyData read() throws Exception {
//...
}
@Value("#{jobParameters['fileName']}")
public void setFileName(final String name) {
//...
}
}
This is what Spring Batch is saying in runtime:
Field or property 'jobParameters' cannot be found on object of
type 'org.springframework.beans.factory.config.BeanExpressionContext'
What's wrong here? Where I can read more about these mechanisms in Spring 3.0?
Source: (StackOverflow)
I followed this sample for Spring Batch with Boot: https://github.com/spring-projects/spring-boot/blob/master/spring-boot-samples/spring-boot-sample-batch/src/main/java/sample/batch/SampleBatchApplication.java
When you run the main method the job is executed.
This way I can't figure out how one can control the job execution. For example how you schedule a job, or get access to the job execution, or set job parameters.
I tried to register my own JobLauncher
@Bean
public JobLauncher jobLauncher(JobRepository jobRepo){
SimpleJobLauncher simpleJobLauncher = new SimpleJobLauncher();
simpleJobLauncher.setJobRepository(jobRepo);
return simpleJobLauncher;
}
but when i try to use it in the main method:
public static void main(String[] args) {
ConfigurableApplicationContext ctx = SpringApplication.run(Application.class, args);
JobLauncher jobLauncher = ctx.getBean(JobLauncher.class);
//try catch removed for readability
jobLauncher.run(ctx.getBean(Job.class), new JobParameters());
}
The job is again xecuted when the context is loaded and i got JobInstanceAlreadyCompleteException when i try to run it manually.
Is there a way to prevet the automatic job execution?
Source: (StackOverflow)
Is it possible to do MapReduce style operations in SpringBatch?
I have two steps in my batch job. The first step calculates average. The second step compares each value with average to determine another value.
For example, Lets say i have a huge database of Student scores. The first step calculates average score in each course/exam. The second step compares individual scores with average to determine grade based on some simple rule:
- A if student scores above average
- B if student score is Average
- C if student scores below average
Currently my first step is a Sql which selects average and writes it to a table. Second step is a Sql which joins average scores with individual scores and uses a Processor to implement the rule.
There are similar aggregation functions like avg, min used a lot in Steps and I'd really prefer if this can be done in Processors keeping the Sqls as simple as possible. Is there any way to write a Processor which aggregates results across multiple rows based on a grouping criteria and then Writes Average/Min to the Output table once?
This pattern repeats a lot and i'm not looking for a Single processor implementation using a Sql which fetches both average and individual scores.
Source: (StackOverflow)
I have some questions derived from a problem that I have already solved through this other question. However, I am still wondering about the root cause. My questions are as follows:
- What is the purpose of spring.handlers and spring.schemas?
As I understand it's a way of telling the Spring Framework where to locate the xsd so that everything is wired and loaded correctly. But...
Under what circumstances should I have those two files under the META-INF folder?
In my other question linked above, does anybody know why I had to add the maven-shade-plugin
to create those two files (based on all my dependencies) under META-INF? In other words, what was the ROOT CAUSE that made me have to use the maven shade plugin?
Source: (StackOverflow)
As a Java developer, I'm used to use Spring Batch for batch processing, generally using a streaming library to export large XML files with StAX for exemple.
I'm now developping a Scala application, and wonder if there's any framework, tool or guideline to achieve batch processing.
My Scala application uses the Cake Pattern and I'm not sure how I could integrate this with SpringBatch. Also, I'd like to follow the guidelines described in Functional programming in Scala
and try to keep functional purity, using stuff like the IO monad...
I know this is kind of an open question, but I never read anything about this...
Has anyone already achieved functional batch processing here? How was it working? Am I supposed to have a main that creates a batch processing operation in an IO monad and run it? Is there any tool or guideline to help, monitor or handle restartability, like we use Spring Batch in Java.
Do you use Spring Batch in Scala?
How do you handle the integration part, for exemple waiting for a JMS/AMQP message to start the treatment that produces an XML?
Any feedback on the subjet is welcome
Source: (StackOverflow)
Background
I have a Spring batch program that reads a file (example file I am working with is ~ 4 GB in size), does a small amount of processing on the file, and then writes it off to an Oracle database.
My program uses 1 thread to read the file, and 12 worker threads to do the processing and database pushing.
I am churning lots and lots and lots of young gen memory, which is causing my program to go slower than I think it should.
Setup
JDK 1.6.18
Spring batch 2.1.x
4 Core Machine w 16 GB ram
-Xmx12G
-Xms12G
-NewRatio=1
-XX:+UseParallelGC
-XX:+UseParallelOldGC
Problem
With these JVM params, I get somewhere around 5.x GB of memory for Tenured Generation, and around 5.X GB of memory for Young Generation.
In the course of processing this one file, my Tenured Generation is fine. It grows to a max of maybe 3 GB, and I never need to do a single full GC.
However, the Young Generation hits it's max many times. It goes up to 5 GB range, and then a parallel minor GC happens and clears Young Gen down to 500MB used. Minor GCs are good and better than a full GC, but it still slows down my program a lot (I am pretty sure the app still freezes when a young gen collection occurs, because I see the database activity die off). I am spending well over 5% of my program time frozen for minor GCs, and this seems excessive. I would say over the course of processing this 4 GB file, I churn through 50-60GB of young gen memory.
I don't see any obvious flaws in my program. I am trying to obey the general OO principles and write clean Java code. I am trying not to create objects for no reason. I am using thread pools, and whenever possible passing objects along instead of creating new objects. I am going to start profiling the application, but I was wondering if anyone had some good general rules of thumb or anti patterns to avoid that lead to excessive memory churn? Is 50-60GB of memory churn to process a 4GB file the best I can do? Do I have to revert to JDk 1.2 tricks like Object Pooling? (although Brian Goetz give a presentation that included why object pooling is stupid, and we don't need to do it anymore. I trust him a lot more than I trust myself .. :) )
Source: (StackOverflow)
I have one table that records its row insert/update timestamps on a field.
I want to synchronize data in this table with another table on another db server. Two db servers are not connected and synchronization is one way (master/slave). Using table triggers is not suitable
My workflow:
- I use a global last_sync_date parameter and query table Master for
the changed/inserted records
- Output the resulting rows to xml
- Parse the xml and update table Slave using updates and inserts
The complexity of the problem rises when dealing with deleted records of Master table. To catch the deleted records I think I have to maintain a log table for the previously inserted records and use sql "NOT IN". This becomes a performance problem when dealing with large datasets.
What would be an alternative workflow dealing with this scenario?
Source: (StackOverflow)
I would like some information about the data flow in a Spring Batch processing but fail to find what I am looking for on the Internet (despite some useful questions on this site).
I am trying to establish standards to use Spring Batch in our company and we are wondering how Spring Batch behaves when several processors in a step updates data on different data sources.
This question focuses on a chunked process but feel free to provide information on other modes.
From what I have seen (please correct me if I am wrong), when a line is read, it follows the whole flow (reader, processors, writer) before the next is read (as opposed to a silo-processing where reader would process all lines, send them to the processor, and so on).
In my case, several processors read data (in different databases) and updates them in the process, and finally the writer inserts data into yet another DB. For now, the JobRepository is not linked to a database, but that would be an independent one, making the thing still a bit more complex.
This model cannot be changed since the data belongs to several business areas.
How is the transaction managed in this case? Is the data committed only once the full chunk is processed? And then, is there a 2-phase commit management? How is it ensured? What development or configuration should be made in order to ensure the consistency of data?
More generally, what would your recommendations be in a similar case?
Source: (StackOverflow)
Short question: If I have class that impelemnts FactoryBean interface, how can I get from FactoryBean object itself instead of FactoryBean.getObject()?
Long question: I have to use 3-rd party Spring based library which is hardly use FactoryBean interface. Right now I always must configure 2 beans:
<!-- Case 1-->
<bean id="XYZ" class="FactoryBean1" scope="prototype">
<property name="steps">
<bean class="FactoryBean2">
<property name="itemReader" ref="aName"/>
</bean>
</property>
</bean>
<bean id="aName" class="com.package.ClassName1" scope="prototype">
<property name="objectContext">
<bean class="com.package.ABC"/>
</property>
</bean>
<!-- Case 2-->
<bean id="XYZ2" class="FactoryBean1" scope="prototype">
<property name="steps">
<bean class="FactoryBean2">
<property name="itemReader" ref="aName2"/>
</bean>
</property>
</bean>
<bean id="aName2" class="com.package.ClassName1" scope="prototype">
<property name="objectContext">
<bean class="com.package.QWE"/>
</property>
</bean>
Actyually defintion of a bean with name "XYZ" (compare with "XYZ2") never will be changed, but because of factory nature I must copy the code for each configuration.
Definition of a bean with name "aName" always will be new (i.e. each configuration will have own objectContext value).
I would like to simplify the configuration have a single factory bean (remove "XYZ2" and rid of link to "aName"):
<bean id="XYZ" class="FactoryBean1" scope="prototype">
<property name="steps">
<bean class="FactoryBean2"/>
</property>
</bean>
<bean id="aName" class="com.package.ClassName1" scope="prototype">
<property name="objectContext">
<bean class="com.package.ABC"/>
</property>
</bean>
<bean id="aName2" class="com.package.ClassName1" scope="prototype">
<property name="objectContext">
<bean class="com.package.QWE"/>
</property>
</bean>
Unfortunately, it's not as simple as I expect. I suppose to glue factory (i.e. XYZ bean from the example) with necessary objects (i.e. "aName", "aName2") at runtime.
The approach doesn't work because when I ask Spring for FactoryBean object it returns to me FactoryBean.getObject() which impossible to instanciate at that time because of missing itemReader value.
I hope that SpringSource foresee my case I can somehome "hook" FactoryBean.getObject() call to provide all necessary properties at runtime.
Another complexity that disturb me a bit it's chains of Factories (Factory1 get an object from Factory2 that I have to "hook" at runtime).
Any ideas will be appreciated.
Source: (StackOverflow)
I am trying to build the sample application for spring batch 2.1.6. (ie. spring-batch-2.1.6.RELEASE/samples/spring-batch-samples) using maven but am getting this error for a missing plugin:
[ERROR] Plugin
com.springsource.bundlor:com.springsource.bundlor.maven:1.0.0.RELEASE
or one of its dependencies could not be resolved: Failure to find
com.springsource.bundlor:com.springsource.bundlor.maven:jar:1.0.0.RELEASE
in http://repo1.maven.org/maven2 was cached in the local repository,
resolution will not be reattempted until the update interval of
central has elapsed or updates are forced ->
Is there another repository I can set up to get this plugin? I am a bit suprised to be getting this errror as this is the latest realease version of spring batch.
Here is the repository section from the pom as it came in the download:
<repositories>
<repository>
<id>com.springsource.repository.bundles.external</id>
<name>SpringSource Enterprise Bundle Repository - SpringSource Bundle External</name>
<url>http://repository.springsource.com/maven/bundles/external</url>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
</repositories>
Source: (StackOverflow)