Genome
A simple, type safe, failure driven mapping library for serializing JSON to models in Swift 2.0 (Supports Linux)
i wanted to download gene expression data derived from generated by microarray experiments. i do not know too much about this subject, but as i understand, rows often correspond to genes and columns corresponds to samples. ideally, i expect a matrix of gene expression data.
i've been searching on the internet, and although it may seem like there are many places to download such data, when i actually do download the data, i do not get the matrix of gene expression. could someone please let me know if there is a place or how to download gene expression data in the format that i expect above?
any help is appreciated.
Source: (StackOverflow)
I have an original set of genomic coordinates (chrom, start, end) in a tab delimited bed file. I also have additional tab delimited bed files that contain some of the original genomic coordinates plus a numerical value associated with each of these coordinates. These coordinates can show up multiple times in a bed file with a different numerical value each time. I need a final bed file that contains each of the original genomic coordinates with the summed number of all the values found to be associated with that specific coordinate. Examples of files I'm working with are below.
Original File:
chr1 2100 2300
chr2 3300 3600
chr1 2560 2800
Other Bed file:
chr1 2100 2300 6
chr2 3300 3600 56
chr1 2100 2300 10
Needed Output file:
chr1 2100 2300 16
chr2 3300 3600 56
chr1 2560 2800 0
I need to write a python script to do this, but I'm not really sure what the best way to do it is.
Source: (StackOverflow)
I need to replace '|' into tab so that I can analyze my human annotation genomic data (200+mb). I'm a research assistant learning how to analyze/manipulate sequencing data in the easiest/simplest way so that I can replicate this on more data.
Here how my data looks like. There are ~400,000 lines of this type of data in one file.
ANN=C|downstream_gene_variant|MODIFIER|OR4G4P|ENSG00000268020|transcript|ENST00000606857|unprocessed_pseudogene||n.*1414T>C|||||1414|,C|intron_variant|MODIFIER|OR4G4P|ENSG00000268020|transcript|ENST00000594647|unprocessed_pseudogene|1/1|n.20-104T>C||||||;DP=11;SS=1;VT=SNP
I tried to use this code to replace '|' into '\t' for several lines.
import csv
infile = 'Book2.xlsx'
with open(infile , 'r') as inf:
for line in inf:
w =csv.writer(inf, delimiter = '\t')
print w
All I'm getting is this :
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
Source: (StackOverflow)
I've searched the web for this without much luck. More or less you always get to the example from the VariantAnnotation Package. And since this example works fine on my computer I have no idea why the VCF I created does not.
The problem: I want to determine the number and location of SNPs in selected genes. I have a large VCF file (over 5GB) that has info on all SNPs on all chromosomes for several mice strains. Obviously my computer freezes if I try to do anything on the whole genome scale, so I first determined genomic locations of genes of interest on chromosome 1. I then used the VariantAnnotation Package to get only the data relating to my genes of interest out of the VCF file:
library(VariantAnnotation)
param<-ScanVcfParam(
info=c("AC1","AF1","DP","DP4","INDEL","MDV","MQ","MSD","PV0","PV1","PV2","PV3","PV4","QD"),
geno=c("DP","GL","GQ","GT","PL","SP","FI"),
samples=strain,
fixed="FILTER",
which=gnrng
)
The code above is taken out of a function I wrote which takes strain as an argument. gnrng refers to a GRanges object containing genomic locations of my genes of interest.
vcf<-readVcf(file, "mm10",param)
This works fine and I get my vcf (dim: 21783 1) but when I try to save it won't work
file.vcf<-tempfile()
writeVcf(vcf, file.vcf)
Error in .pasteCollapse(ALT, ",") : 'x' must be a CharacterList
I even tried in parallel, doing the example from the package first and then substituting for my VCF file:
#This is the example:
out1.vcf<-tempfile()
in1<-readVcf(fl,"hg19")
writeVcf(in1,out1.vcf)
This works just fine, but if I only substitute in1 for my vcf I get the same error.
I hope I made myself clear... And any help will be greatly appreciated!! Thanks in advance!
Source: (StackOverflow)
I need to be able to compare the two coordinates (the 2nd and 3rd word in a line) to see where they overlap. Now, my code does it, but it does it very slow. So far for a file with 10000 lines my code takes about two minutes. I need to use it for a file with 3 billion lines, which I estimate will take forever. Is there a way to refactor my code to be so much faster?
So far I can do exactly what I want. Which is this:
import os.path
with open("Output.txt", "w") as result:
with open("bedgraph2.txt") as file1:
for f1_line in file1:
segment_1 = f1_line.split()
with open("bedgraph1.txt") as file2:
for f2_line in file2:
segment_2 = f2_line.split()
if (int(segment_1[2]) > int(segment_2[1])) & (int(segment_1[1]) < int(segment_2[2])):
with open("Output.txt", "a") as add:
add.write(segment_1[0])
add.write(" ")
add.write(segment_1[1])
add.write(" ")
add.write(segment_1[2])
add.write(" ")
add.write(segment_1[3])
add.write(" | ")
add.write(segment_2[0])
add.write(" ")
add.write(segment_2[1])
add.write(" ")
add.write(segment_2[2])
add.write(" ")
add.write(segment_2[3])
add.write("\n")
break
print "done"
This is a sample of the data
bedgraph2.txt
chr01 1780 1795 -0.811494
chr01 1795 1809 -1.622988
chr01 1809 1829 -2.434482
chr01 1829 1830 -3.245976
chr01 1830 1845 -2.434482
chr01 1845 1859 -1.622988
chr01 1859 1879 -0.811494
chr01 1934 1984 -0.811494
chr01 3550 3600 -0.811494
chr01 3790 3840 -0.811494
chr01 3882 3902 -0.811494
chr01 3902 3932 -1.622988
bedgraph1.txt
chr01 1809 1859 -1.139687
chr01 1965 2015 -1.139687
chr01 3790 3840 -1.139687
chr01 3930 3942 -1.139687
chr01 3942 3980 -2.279375
chr01 3980 3992 -1.139687
chr01 4260 4310 -1.139687
chr01 4361 4382 -1.139687
chr01 4382 4411 -2.279375
chr01 4411 4432 -1.139687
chr01 4473 4523 -1.139687
chr01 4605 4655 -1.139687
Thanks in advance
Source: (StackOverflow)
This dataset represents a genome map positions (chr and start) with the sum of the sequencing coverage (depth) of each position for 20 individuals (dat)
Example:
gbsgre <- "chr start end depth
chr1 3273 3273 7
chr1 3274 3274 3
chr1 3275 3275 8
chr1 3276 3276 4
chr1 3277 3277 25"
gbsgre <- read.table(text=gbsgre, header=T)
This dataset represents a genome map positions (V1 plus V2) with individual coverage (V3) for each position.
Example:
df1 <- "chr start depth
chr1 3273 4
chr1 3276 4
chr1 3277 15"
df1 <- read.table(text=df1, header=T)
df2 <- "chr start depth
chr1 3273 3
chr1 3274 3
chr1 3275 8
chr1 3277 10"
df2 <- read.table(text=df2, header=T)
dat <- NULL
dat[[1]] <- df1
dat[[2]] <- df2
> dat
[[1]]
chr start depth
1 chr1 3273 4
2 chr1 3276 4
3 chr1 3277 15
[[2]]
chr start depth
1 chr1 3273 3
2 chr1 3274 3
3 chr1 3275 8
4 chr1 3277 10
According to the chr
and start
position on gbsgre
, I need to cross all the 20 depths (V3) of each 20 animals ([[1]] to [[20]]) to the main table (gbsgre) to generate a final table as follows:
The first column will be the chromosome position (V1), second column (V2) will be the start position, third will be the depth (V3) of the “gbsgre” dataset, the fourth (V4) will be the depth (dat/V3) of the [[1]] from “dat”, and so on, until the twenty-fourth column, which will be the depth of the [[20]] on the “dat” dataset.
But a very important thing is that, missing data on the 20 individuals should be considered like zero (“0”).
And the number of final table should be the same of “gbsgre”.
#Example Result
> GBSMeDIP
chr start depth depth1 depth2
1: chr1 3273 7 4 3
2: chr1 3274 3 0 3
3: chr1 3275 8 0 8
4: chr1 3276 4 4 0
5: chr1 3277 25 15 10
Source: (StackOverflow)
I want to get a lot human genome fragments (more than 500 million of them) randomly.
This is a partial work of the whole process. I have .sam result file from bowtie, with 10 million human genome reads alignment. I want to compare each query reads with the 'reference sequence it aligned to' from the sam file. The reference sequence I used is hg19.fa from UCSC. So I need to be able to get the sequence from hg19.fa (or chromosome files) by using the location in the sam file.
e.g. with giving: chr4:35654-35695, i could get 42bp sequences:
gtcttccagggtttttatatttttgggttttacacttaagt
so far, i had 2 solutions:
1. python script to fetch sequences from UCSC DAS server:
http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr4:35654,35695
- using python script call ''samtools faidx'' command and return commnad output,
from post:
http://seqanswers.com/forums/showthread.php?t=23606&highlight=fetch+genome+coordinate
but, they are slow. samtools faidx is bit faster than getting it from DAS server, but still slow.
so, is there any FAST way to do this? i have the seprate chromosome fasta files, and hg19.fa file.
Source: (StackOverflow)
In AI, are there any simple and/or very visual examples of how one could implement a genome into a simulation?
Basically, I'm after a simple walkthrough (not a tutorial, but rather something of a summarizing nature) which details how to implement a genome which changes the characteristics in an 'individual' in a sumlation.
These genes would not be things like:
- Mass
- Strength
- Length,
- Etc..
But rather they should be the things defining the above things, abstracting the genome from the actual characteristics of the inhabitants of the simulation.
Am I clear enough on what I want?
Anyway, if there's any way that you have tried that's better, and that implements evolution in a form like these sexual swimmers, then by all means, go ahead and post it! The more fun inspiration the better :)
Source: (StackOverflow)
I’m trying to use a Python Regular Expression to extract a genome sequence from a genome database; I’ve pasted a snippet of the database below.
>GSVIVT01031739001 pacid=17837850 polypeptide=GSVIVT01031739001 locus=GSVIVG01031739001 ID=GSVIVT01031739001.Genoscope12X annot-version=Genoscope.12X ATGAAAACGGAACTCTTTCTAGGTCATTTCCTCTTCAAACAAGAAAGAAGTAAAAGTTGCATACCAAATATGGACTCGAT TTGGAGTCGTAGTGCCCTGTCCACAGCTTCGGACTTCCTCACTGCAATCTACTTCGCCTTCATCTTCATCGTCGCCAGGT TTTTCTTGGACAGATTCATCTATCGAAGGTTGGCCATCTGGTTATTGAGCAAGGGAGCTGTTCCATTGAAGAAAAATGAT GCTACACTGGGAAAAATTGTAAAATGTTCGGAGTCTTTGTGGAAACTAACATACTATGCAACTGTTGAAGCATTCATTCT TGCTATTTCCTACCAAGAGCCATGGTTTAGAGATTCAAAGCAGTACTTTAGAGGGTGGCCAAATCAAGAGTTGACGCTTC CCCTCAAGCTTTTCTACATGTGCCAATGTGGGTTCTACATCTACAGCATTGCTGCCCTTCTTACATGGGAAACTCGCAGG AGGGATTTCTCTGTGATGATGTCTCATCATGTAGTCACTGTTATCCTAATTGGGTACTCATACATATCAAGTTTTGTCCG GATCGGCTCAGTTGTCCTTGCCCTGCACGATGCAAGTGATGTCTTCATGGAAGCTGCAAAAGTTTTTAAATATTCTGAGA AGGAGCTTGCAGCAAGTGTGTGCTTTGGATTTTTTGCCATCTCATGGCTTGTCCTACGGTTAATATTCTTTCCCTTTTGG GTTATCAGTGCATCAAGCTATGATATGCAAAATTGCATGAATCTATCGGAGGCCTATCCCATGTTGCTATACTATGTTTT CAATACAATGCTCTTGACACTACTTGTGTTCCATATATACTGGTGGATTCTTATATGCTCAATGATTATGAGACAGCTGA AAAATAGAGGACAAGTTGGAGAAGATATAAGATCTGATTCAGAGGACGATGAATAG
>GSVIVT01031740001 pacid=17837851 polypeptide=GSVIVT01031740001 locus=GSVIVG01031740001 ID=GSVIVT01031740001.Genoscope12X annot-version=Genoscope.12X ATGGGTATTACTACTTCCCTCTCATATCTTTTATTCTTCAACATCATCCTCCCAACCTTAACGGCTTCTCCAATACTGTT TCAGGGGTTCAATTGGGAATCATCCAAAAAGCAAGGAGGGTGGTACAACTTCCTCATCAACTCCATTCCTGAACTATCTG CCTCTGGAATCACTCATGTTTGGCTTCCTCCACCCTCTCAGTCTGCTGCATCTGAAGGGTACCTGCCAGGAAGGCTTTAT GATCTCAATGCATCCCACTATGGTACCCAATATGAACTAAAAGCATTGATAAAGGCATTTCGCAGCAATGGGATCCAGTG CATAGCAGACATAGTTATAAACCACAGGACTGCTGAGAAGAAAGATTCAAGAGGAATATGGGCCATCTTTGAAGGAGGAA CCCCAGATGATCGCCTTGACTGGGGTCCATCTTTTATCTGCAGTGATGACACTCTTTTTTCTGATGGCACAGGAAATCCT GATACTGGAGCAGGCTTCGATCCTGCTCCAGACATTGATCATGTAAACCCCCGGGTCCAGCGAGAGCTATCAGATTGGAT GAATTGGTTAAAGATTGAAATAGGCTTTGCTGGATGGCGATTCGATTTTGCTAGAGGATACTCCCCAGATTTTACCAAGT TGTATATGGAAAACACTTCGCCAAACTTTGCAGTAGGGGAAATATGGAATTCTCTTTCTTATGGAAATGACAGTAAGCCA AACTACAACCAAGATGCTCATCGGCGTGAGCTTGTGGACTGGGTGAAAGCTGCTGGAGGAGCAGTGACTGCATTTGATTT TACAACCAAAGGGATACTCCAAGCTGCAGTGGAAGGGGAATTGTGGAGGCTGAAGGACTCAAATGGAGGGCCTCCAGGAA TGATTGGCTTAATGCCTGAAAATGCTGTGACTTTCATAGATAATCATGACACAGGTTCTACACAAAAAATTTGGCCATTC CCATCAGACAAAGTCATGCAGGGATATGTTTATATCCTCACTCATCCTGGGATTCCATCCATATTCTATGACCACTTCTT TGACTGGGGTCTGAAGGAGGAGATTTCTAAGCTGATCAGTATCAGGACCAGGAACGGGATCAAACCCAACAGTGTGGTGC GTATTCTGGCATCTGACCCAGATCTTTATGTAGCTGCCATAGATGAGAAAATCATTGCTAAGATTGGACCAAGGTATGAT GTTGGGAACCTTGTACCTTCAACCTTCAAACTTGCCACCTCTGGCAACAATTATGCTGTGTGGGAGAAACAGTAA
>GSVIVT01031741001 pacid=17837852 polypeptide=GSVIVT01031741001 locus=GSVIVG01031741001 ID=GSVIVT01031741001.Genoscope12X annot-version=Genoscope.12X ATGTCCAAATTAACTTATTTATTATCTCGGTACATGCCAGGAAGGCTTTATGATCTGAATGCATCCAAATATGGCACCCA AGATGAACTGAAAACACTGATAAAGGTGTTTCACAGCAAGGGGGTCCAGTGCATAGCAGACATAGTTATAAACCACAGAA CTGCAGAGAAGCAAGACGCAAGAGGAATATGGCCATCTTTGAAGGAGGAACCCCAGATGATCGCCTTGACTGGACCCCAT CTTTCCTTTGCAAGGACGACACTCCTTATTCCGACGGCACCGGAAACCCTGATTCTGGAGATGACTACAGTGCCGCACCA GACATCGACCACATCAACCCACGGGTTCAGCAAGAGCTAA
What I’m trying to do is get the genome (ACGT) sequence for GSVIV01031740001 (the middle sequence), and none of the others. My current regex is
sequence = re.compile('(?<=>GSVIVT01031740001) pacid=.*annot-version=.*\n[ACGT\n]*[^(?<!>GSVIVT01031740001) pacid]’)
with my logic being find the header with the genbank ID for the correct organism, give me that line, then go to a new line and give me all ACGT and new lines until I get to a header for an organism with a different genbank ID. This fails to give any results.
Yes, I know that re.compile doesn’t actually perform a search; I’m searching against a file opened as ‘target’ so my execution looks like
>>> for nucl in target:
... if re.search(sequence, nucl):
... print(nucl)
Can someone tell me what I’m doing wrong, either in my regex or by using regex in the first place? When I try this on regex101.com, it works, but when I try it in the Python interpreter (2.7.1), it fails.
Thanks!
Source: (StackOverflow)
I need both my alignment files to be in both bowtie and samtools format so that I can feed them into different programs later on in my pipeline. Is there any method I can use to convert a sam alignment file into a bowtie alignment file and vice versa?
An alternative would be to do the alignment twice and get the bowtie program to output it in different formats in each case. However, this wastes too much time.
Source: (StackOverflow)
I'm searching for a couple of hours (actually already two days) but I can't find an answer to my problem yet. I've tried Sed and Awk but I can't get the parameters right.
Essentially, this is what I'm looking for
FOR every line in file_1
IF [value in colum2 in file_1]
IS EQUAL TO [value in column 4 in some row in file_2]
OR IS EQUAL TO [value in column 5 in some row in file_2]
OR IS BETWEEN [value column 4 and value column 5 in some row in file_2]
THAN
ADD column 3, 6 and 7 of some row of file_2 to column 3, 4 and 5 of file_1
NB: Values that needs to be compared are INTs, values in col 3, 6 and 7 (that only needs to be copied) are STRINGs
And this is the context, but probably not necessary to read:
I've two files with genome data which I want to merge in a specific way (the columns are tab separated)
- The first file contains variants (only SNPs for the ones interested) of which, efficiently, only the second column is relevant. This column is a list of numbers (position of that variant on the chromosome)
- I have a structural annotation files that contains the following data:
- In column 4 is a begin position of the specific structure and in column 5 is the end position.
- Column 3, 7 and 9 contains information that describes the specific structure (name of a gene etc.)
I would like to annotate the variants in the first file with the data in the annotation file. Therefore, if the number in column 2 of the variants file is equal to column 4 or 5 OR between those values in a specific row, columns 3, 7 and 9 of that specific row in the annotation needs to be added.
Sample File 1
SOME_NON_RELEVANT_STRING 142
SOME_NON_RELEVANT_STRING 182
SOME_NON_RELEVANT_STRING 320
SOME_NON_RELEVANT_STRING 321
SOME_NON_RELEVANT_STRING 322
SOME_NON_RELEVANT_STRING 471
SOME_NON_RELEVANT_STRING 488
SOME_NON_RELEVANT_STRING 497
SOME_NON_RELEVANT_STRING 541
SOME_NON_RELEVANT_STRING 545
SOME_NON_RELEVANT_STRING 548
SOME_NON_RELEVANT_STRING 4105
SOME_NON_RELEVANT_STRING 15879
SOME_NON_RELEVANT_STRING 26534
SOME_NON_RELEVANT_STRING 30000
SOME_NON_RELEVANT_STRING 30001
SOME_NON_RELEVANT_STRING 40001
SOME_NON_RELEVANT_STRING 44752
SOME_NON_RELEVANT_STRING 50587
SOME_NON_RELEVANT_STRING 87512
SOME_NON_RELEVANT_STRING 96541
SOME_NON_RELEVANT_STRING 99541
SOME_NON_RELEVANT_STRING 99871
Sample File 2
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A1 0 38 B1 C1
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A2 40 2100 B2 C2
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A3 2101 9999 B3 C3
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A4 10000 15000 B4 C4
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A5 15001 30000 B5 C5
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A6 30001 40000 B6 C6
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A7 40001 50001 B7 C7
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A8 50001 50587 B8 C8
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A9 50588 83054 B9 C9
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A10 83055 98421 B10 C10
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A11 98422 99999 B11 C11
Sample output file
142 A2 B2 C2
182 A2 B2 C2
320 A2 B2 C2
321 A2 B2 C2
322 A2 B2 C2
471 A2 B2 C2
488 A2 B2 C2
497 A2 B2 C2
541 A2 B2 C2
545 A2 B2 C2
548 A2 B2 C2
4105 A3 B3 C3
15879 A5 B5 C5
26534 A5 B5 C5
30000 A5 B5 C5
30001 A6 B6 C6
40001 A7 B7 C7
44752 A7 B7 C7
50587 A8 B8 C8
87512 A10 B10 C10
96541 A10 B10 C10
99541 A11 B11 C11
99871 A11 B11 C1
1
Source: (StackOverflow)
I have a string such as : abcgdfabc
I want to do like following:
input: a string, e.g.:
abcgdfabc
output: a dict (key is the "words",and value is the time it shows up),
abc:2
gdf:1
words is the maxmium lenght of "words", it should be greedy match.
I have spent a lot time on it, and can't figure out.
The string is longer than 5000, it's a genome, we want to find out the relationship of it, the first time we have to find such a dict to make data more clear, help.
Source: (StackOverflow)
I would like to found the exactly same genomic intervals shared between samples (NE_id
).
My Input:
chr start_call end_call NE_id
chr1 150 200 NE01
chr1 150 200 NE02
chr2 100 150 NE01
chr2 100 160 NE02
chr3 200 300 NE01
chr3 200 300 NE02
My expected output:
chr start_call end_call NE_id
chr1 150 200 NE01, NE02
chr3 200 300 NE01, NE02
In this example the chr2
genomic interval have some overlap, however it don´t correspond to the exact same genomic interval (size difference == 10
).
Thank you very much.
Source: (StackOverflow)