WO2015194655A1 - METHOD, COMPUTER SYSTEM AND SOFTWARE FOR SELECTING Tag SNP, AND DNA MICROARRAY EQUIPPED WITH NUCLEIC ACID PROBE CORRESPONDING TO Tag SNP SELECTED BY SAID SELECTION METHOD - Google Patents

METHOD, COMPUTER SYSTEM AND SOFTWARE FOR SELECTING Tag SNP, AND DNA MICROARRAY EQUIPPED WITH NUCLEIC ACID PROBE CORRESPONDING TO Tag SNP SELECTED BY SAID SELECTION METHOD Download PDF

Info

Publication number
WO2015194655A1
WO2015194655A1 PCT/JP2015/067686 JP2015067686W WO2015194655A1 WO 2015194655 A1 WO2015194655 A1 WO 2015194655A1 JP 2015067686 W JP2015067686 W JP 2015067686W WO 2015194655 A1 WO2015194655 A1 WO 2015194655A1
Authority
WO
WIPO (PCT)
Prior art keywords
snp
tagsnp
information
tag
snps
Prior art date
Application number
PCT/JP2015/067686
Other languages
French (fr)
Japanese (ja)
Inventor
正朗 長崎
要 小島
直樹 成相
隆広 三森
洋介 河合
Original Assignee
国立大学法人東北大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from JP2014223834A external-priority patent/JP6432974B2/en
Application filed by 国立大学法人東北大学 filed Critical 国立大学法人東北大学
Priority to US15/320,438 priority Critical patent/US20170147745A1/en
Publication of WO2015194655A1 publication Critical patent/WO2015194655A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/20Polymerase chain reaction [PCR]; Primer or probe design; Probe optimisation
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12MAPPARATUS FOR ENZYMOLOGY OR MICROBIOLOGY; APPARATUS FOR CULTURING MICROORGANISMS FOR PRODUCING BIOMASS, FOR GROWING CELLS OR FOR OBTAINING FERMENTATION OR METABOLIC PRODUCTS, i.e. BIOREACTORS OR FERMENTERS
    • C12M1/00Apparatus for enzymology or microbiology
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N37/00Details not covered by any other group of this subclass
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Definitions

  • the present invention is an invention related to the field of genetic analysis based on nucleic acids, and more specifically, based on information about single nucleotide polymorphisms (SNPs) in the human genome, from less SNP information, with better accuracy, SNPs in individual human genomes It is an invention that provides means for deriving overall information.
  • SNPs single nucleotide polymorphisms
  • Custom-made (tailor-made) medicine means medical treatment that applies treatment methods suited to the individual patient's constitution in a custom-made manner, rather than capturing treatment methods uniformly.
  • An essential element in knowing the constitution of each individual patient is the genetic information of the individual, and the relationship between various genetic information and constitutions and diseases is now being clarified through the decoding of the human genome. In such a situation, SNP is one of the human genetic elements that are currently focused on.
  • SNP is an abbreviation for single nucleotide polymorphism, which means a single base difference between individuals (single nucleotide polymorphism).
  • SNPs are the most numerous, and it is estimated that there are 30 million or more SNPs in the human genome. SNP is recognized as one of the most important factors when considering individual differences between humans. Currently, SNP has been analyzed for its relationship with illness and constitution, its relationship with the efficacy of drugs, and so on. Many results have been achieved.
  • DNA microarrays which are used as a comprehensive analysis method of genes, to SNP analysis in the human genome.
  • the SNP nucleic acid probe (hereinafter also referred to as “nucleic acid probe”) substantially includes a nucleotide sequence fragment on the human genome containing SNP bases or a complementary strand thereof.
  • SNPs alone have more than 30 million, and it is technically and costly that the corresponding nucleic acid probes are comprehensively mounted on DNA microarrays and widely used for SNP detection. Have difficulty.
  • the nucleic acid probes to be mounted on the DNA microarray are limited to those related to the human constitution or disease, and the nucleic acid probes to be mounted are narrowed down by performing a process called imputation. Attempts have been made.
  • This attempt focuses on the fact that SNPs in the genome are correlated with each other. SNPs with high correlation are concentrated in a limited region (haplotype block), and if the imputation selects an appropriate SNP (TagSNP) from the haplotype blocks, the SNP (TargetSNP) that strongly correlates with TagSNP Is a technique for narrowing down SNPs to be mounted on DNA microarrays based on the idea that genotypes can be estimated with high probability without performing typing by experiments.
  • the above-mentioned prior art document 1 discloses an attempt to select a TagSNP with an appropriately high probability chain using the relationship with the TargetSNP from TagSNP candidates.
  • the number of nucleic acid probes mounted on a DNA microarray for SNP detection with high estimation accuracy exceeds 1 million, and the cost is high.
  • the number is smaller than that, the estimation accuracy is lowered, and there is a problem in providing predictability of diseases and the like based on accurate SNPs.
  • the present inventors have examined the use of “mutual information” used for prediction of RNA secondary structure and image alignment in medical image processing as an index for selecting appropriate TagSNPs. Surprisingly, the number of nucleic acid probes corresponding to Tag SNP used for DNA microarrays for SNP detection was greatly reduced, and imputation was performed based on the results obtained with the DNA microarrays. As a result, the present inventors have found that it is possible to maintain an accuracy equal to or higher than that of an existing commercial DNA microarray or the like, and completed the present invention.
  • SNP in the present invention is an abbreviation for single nucleotide polymorphism (single nucleotide polymorphism) and means both singular and plural, like “nucleic acid probe”.
  • the “group” in the “SNP group” or “nucleic acid probe group” conceptually means the presence of a large number of SNPs or nucleic acid probes, but strictly speaking, a plurality, that is, two or more SNPs or nucleic acids. It also means the presence of the probe.
  • the “nucleic acid probe corresponding to Tag SNP” is a nucleic acid probe for specifying the SNP, and is specifically disclosed in the column of “array of the present invention” in item (3) of the embodiment of the invention. It is what is done.
  • the present invention provides the following inventions.
  • the present invention uses nucleic acid probe groups corresponding to Tag SNPs used as means for imputing human genome SNP information using human genome information including information of SNP groups in which genotypes of a plurality of individuals are specified.
  • a method for selecting the TagSNP to configure a) Using the SNP group in the human genome information as a population, the SNPs present in the vicinity defined within a certain range from the locus of each SNP as a Tag SNP candidate are Target SNPs, and the Tag SNP candidates and these Target SNPs And calculate the sum of mutual information between b) From all TagSNP candidates, select TagSNP candidates having a large sum of mutual information amounts as TagSNPs to be present in the nucleic acid probe used as the means for imputation, in order of increasing sum.
  • the present invention provides a method for selecting TagSNP (hereinafter also referred to as the selection method of the present invention).
  • the present invention provides a DNA microarray (also referred to as an array of the present invention), characterized in that a nucleic acid probe corresponding to a TagSNP selected according to the selection method of the present invention is mounted.
  • the array of the present invention can be produced by a DNA microarray production method (hereinafter also referred to as the array production method of the present invention), which comprises the following steps (1) and (2). (1) a first step of selecting a TagSNP according to the selection method of the present invention; (2) A second step of mounting a nucleic acid probe for detecting the genotype of the TagSNP in the human genome in the sample on the DNA microarray based on the TagSNP selected in the first step.
  • the present invention provides the following computer system (hereinafter also referred to as the computer system of the present invention). That is, the computer system of the present invention uses a nucleic acid probe corresponding to a Tag SNP used as means for imputing SNP information of a human genome using human genome information including information of SNP groups in which genotypes of a plurality of individuals are specified.
  • a computer system for selecting the TagSNP to form a group comprising a recording unit and an arithmetic processing unit; (A) In the recording unit, information on TagSNP candidates read from the human genome information, and information on SNPs present in the vicinity defined by a certain range from the gene loci of those TagSNP candidates are used as TargetSNP information.
  • the arithmetic processing unit calculates the sum of mutual information amounts with the Target SNP corresponding to each Tag SNP candidate based on the information of (1) to (4) of (A) from the recording unit.
  • the present invention provides the following computer program (hereinafter also referred to as the program of the present invention). That is, the program of the present invention uses nucleic acid probe groups corresponding to Tag SNPs used as means for imputing human genome SNP information using human genome information including information of SNP groups in which genotypes of a plurality of individuals are specified.
  • a computer program for selecting the TagSNP to configure the computer (A) Target SNP information, which is information on the Tag SNP candidates read from the human genome information, and information on SNPs present in the vicinity defined in a certain range from the locus of those Tag SNP candidates, (1) a locus on the human genome of each TagSNP candidate; (2) Genotypes of TagSNP candidates in individual human genome information, (3) the locus of TargetSNP on the human genome, (4) Genotype of TargetSNP in individual human genome information, A first function for reading the information of (1) to (4) for processing in the arithmetic processing unit from the recording unit in which is recorded; (B) Based on the information of (1) to (4) read out by the first function, the sum of mutual information amounts with the Target SNP corresponding to each Tag SNP candidate is calculated.
  • Target SNP information which is information on the Tag SNP candidates read from the human genome information, and information on SNPs present in the vicinity defined in a certain range from the locus of those Tag SNP candidates, (1) a locus on the human genome of
  • a Tag SNP candidate is selected and selected as a second Tag SNP, and then the steps (B) and (C) are repeated, and this repetition step is repeated for the selection of the Mth Tag SNP (M is a natural number).
  • a third function that is repeated until the value of the natural number M reaches a predetermined number of nucleic acid probes corresponding to the selected Tag SNP, which is used as a means for imputation that has been determined;
  • a computer program characterized by including an algorithm for realizing the above.
  • the present invention further provides a computer-readable recording medium (hereinafter also referred to as the recording medium of the present invention), in which the program of the present invention is recorded.
  • a typical computer system of the present invention is characterized by executing the program of the present invention.
  • Target SNP group used to calculate the sum of mutual information amounts for each Tag SNP candidate is a Target SNP group that has been narrowed down in advance by an index other than the mutual information amount. It is suitable from the viewpoint of selection efficiency. From the same point of view, in the program of the present invention, in the previous stage of the algorithm for realizing the second function, the Target SNP group is selected by an index other than the mutual information amount and becomes the target that exhibits the second function. It is preferable that an algorithm that realizes preliminary narrowing of the TargetSNP group is provided.
  • the “index other than mutual information amount” is a linkage disequilibrium value between a Tag SNP candidate and a Target SNP existing in the vicinity of the Tag SNP locus and a certain range, for example, an r 2 linkage disequilibrium value. And d-chain disequilibrium values are typical.
  • the “indexes other than the mutual information amount” described above it is preferable to use the “r 2 linkage disequilibrium value”.
  • the threshold value of the value is preferably in the range of 0.70 to 0.85. If this threshold value exceeds 0.85, the prior narrowing down becomes too strict, and there is a greater risk of excluding the originally preferred Tag SNP candidate from the selection target. If it is less than 0.70, the target for calculating the sum of mutual information There is a tendency for the selection process to become inefficient due to the excessive increase in the number of prior refinements.
  • the “vicinity defined within a certain range” from the TagSNP candidate locus is preferably within 500 kbp upstream and downstream from the TagSNP locus. Is 100 to 500 kbp.
  • the “number of Tag SNPs to be selected” is used as the number of Tag SNPs to be selected for a nucleic acid probe used as a means for imputation. It is required that the result of performing the imputation is not less than the number satisfying the predetermined performance.
  • the definite index of the “predetermined performance” is not particularly limited, but is preferably an index that can more objectively reflect the imputation performance of the means using the TagSNP information. is there.
  • the average of the squares of the correlation coefficient between the genotype typed by experiment and the locus estimated by imputation is 0. 0.94 or more, preferably 0.95 or more, more preferably 0.96 or more. If it is less than this number, it cannot be said that the correlation between the result of imputation based on the selected TagSNP typing result and the actual genotype is superior to that of the conventional product, and is expected in the present invention. It becomes difficult to fully demonstrate the usefulness of the conventional product.
  • the mean square of the correlation coefficient between the genotype and the actual genotype by SNP imputation at 3 to 5% of MAF is 0.82 or more, preferably 0.84 or more, more preferably 0.87 or more.
  • the average of the squares of the correlation coefficient between the genotype and the actual genotype by the SNP imputation at 1 to 3% of MAF is 0.73 or more, preferably 0.75 or more, more preferably 0.79 It is also possible to use the above indicators.
  • the upper limit of the number is not particularly limited, it is 1 million or less at the time of completion of the present invention, and more than 700,000 is preferable from the viewpoints of both the economical efficiency due to the number used and the certainty of the predicted content for the SNP. is there.
  • a specific lower limit of the number is about 300,000. As shown in the examples described later, it has been clarified that even if the number is 300,000, excellent imputation exceeding the basic level based on the MAF can be performed. Preferably, about 400,000 or more, more preferably about 500,000 or more, and very preferably about 600,000 or more are assumed. Depending on the expected performance of the array of the present invention, the above MAF may be used. An appropriate number can be selected by referring to the index or the like. In Japanese Patent Application No. 2014-223834, up to 675,000 TagSNPs in Japanese were actually identified and disclosed in the specification.
  • the “degree” indicating the number of SNPs such as “about 300,000 or about 400,000” is the same as “about”, but in particular, a specific number of tag SNPs such as “300,000”
  • the imputation performance does not change substantially within a certain number of widths. Specifically, if the difference is within 1% of the specific TagSNP number, strictly within 0.5%, there is no difference in substantial imputation performance. This is a value that serves as a guide when several SNPs are removed from the Tag SNP group once selected. Further, when the SNP removed from the selected Tag SNP is not actually contributing to the imputation, the influence on the imputation performance is further reduced even if the SNP is removed.
  • Tag SNP groups selected according to the selection method of the present invention when they are actually converted into corresponding nucleic acid probes and mounted on a DNA microarray, they are not actually detected as SNPs in the population to which the present invention is applied.
  • TagSNPs that do not exhibit appropriate imputation performance are slightly recognized. This is mainly clarified by ex-post verification, but it is also possible to remove such a non-functioning SNP from the Tag SNP group to be further used. Since the number of SNPs to be extracted is relatively small (about 0.1% at most), even if such extraction is performed, the above-mentioned “imputation performance is It will be well within the “substantially unchanged range”. In other words, when a specific number of Tag SNPs are selected in accordance with the selection method of the present invention, it is possible to anticipate the number of SNPs extracted corresponding to the ratio (%) as described above.
  • Human genome information used in the execution of the selection method and computer system of the present invention can be based on information in the human genome database, for example, a database for all human beings of the International 1000 Human Genome Project.
  • the accuracy of SNP estimation based on TagSNP tends to be improved by using human genome information with a smaller category.
  • Asian Mongoloid more specifically Japanese, Chinese, Malay, Polynesian, Micronesian, etc .
  • Caucasian more specifically Italian, English, Egyptian, Indian, Lap, etc .
  • Amerind more specifically, Eskimo, Brazilian Indian, Alaska Indian, etc .
  • Negroid more specifically, Nigerian, Bantu, Bushua, etc .
  • Australoid more specifically, Australian natives, Papua New Guineans, etc.
  • the existence of specific human genome information is a prerequisite.
  • the usefulness of the present invention was verified by performing verification based on a human genome database of 1070 Japanese people from “Tohoku University Tohoku Medical Megabank Organization (ToMMo)”.
  • the genotype detected by the nucleic acid probe group corresponding to the TagSNP selected in the present invention is used for imputing the SNP information of the human genome as described above. It is suitable as a thing.
  • This “means for detecting the genotype detected by the nucleic acid probe group corresponding to Tag SNP” is not particularly limited as long as it can detect the genotype of SNP, and is currently provided.
  • a specific method for producing an array of the present invention using a nucleic acid probe capable of detecting a base polymorphism in the TagSNP base can be performed according to a known method for producing a DNA microarray at the time of the present invention, It is also possible to apply a DNA microarray production method provided in the future.
  • the selection method of the present invention it is possible to select one or more other SNPs separately from the Tag SNP selection by the selection method of the present invention, and to carry over the Tag SNP with priority. It is also possible to mount nucleic acid probe groups corresponding to the other SNPs in the array of the present invention.
  • one or more other SNPs should be selected separately from the Tag SNP selection by the selection method of the present invention, and these other SNPs should be selected. It is possible to carry in with priority as an SNP.
  • the program of the present invention separately from the Tag SNP selection by the selection method of the present invention, one or more other SNPs are selected, and these other SNPs are selected as SNPs to be selected. It is possible to provide an algorithm that realizes the identification with priority.
  • other SNP shall mean “other 1 type (s) or 2 or more types SNP” mentioned above.
  • the duplication between the other SNPs and the Tag SNP selected by the selection method of the present invention is removed.
  • the method for removing one of the duplicated SNPs is not particularly limited.
  • the SNPs to be preferentially removed from the SNP population used when selecting the Tag SNPs may be removed in advance.
  • SNPs As other SNPs, practically useful SNPs which are difficult to be selected by the selection method of the present invention are preferable.
  • an object such as further characterizing the DNA array can be achieved.
  • SNPs that are candidates for use as “other SNPs” include (a) SNPs that have a weak linkage disequilibrium with Tag SNPs and are difficult to estimate genotypes with sufficient accuracy by imputation, (B) Y chromosome and mitochondrial SNPs, (c) SNPs that have been reported to be related to diseases from previous studies, (d) SNPs in the HLA region, (e) SNPs that have been reported to be related to drug metabolism, etc. Is mentioned. These will be described more specifically as follows.
  • HNP SNP For other SNP in this classification, HLA region is a region associated is reported many disease, be selected independently of the linkage disequilibrium values r 2 from the TagSNP is practically suitable.
  • the number of Tag SNPs used in a means for imputation can be greatly saved, and the imputation performance based on the results obtained in the means can be improved with existing commercial DNA.
  • a means for imputation such as a DNA microarray for SNP detection
  • the present invention makes it possible to select a nucleic acid probe for SNP detection at a low cost based on the above significant savings in the number of Tag SNPs and excellent imputation performance. Can be offered at low cost.
  • One of the objects of the present invention is that, as described above, the number of Tag SNPs corresponding to the nucleic acid probes mounted on the array used in the means for performing the imputation using the DNA microarray for SNP detection is greatly increased. Select a Tag SNP group that can save and maintain an accuracy of imputation performance based on the result obtained by the means equivalent to or better than that of existing commercial DNA microarrays, etc., and select corresponding nucleic acid probes.
  • the purpose is to prepare an on-board DNA microarray. This object can be achieved according to the selection method of the present invention described above.
  • the selection method of the present invention can be performed by executing the program of the present invention in the computer system of the present invention.
  • the means for specifying the SNP group is a next-generation sequencer (NGS) or the like. It can carry out using a well-known statistical process from the base sequence of several human genomes using.
  • NGS next-generation sequencer
  • the frequency of the genotypes of TagSNP and TargetSNP is calculated from “genotype”.
  • the frequency can be obtained by a conventional method.
  • the haplotype of the SNP group is specified, it is possible to calculate the linkage disequilibrium value and mutual information of the SNP group more precisely, which is preferable.
  • the frequency of the aforementioned genotype is replaced with the frequency of alleles constituting the genotype, and the frequency of the combination of genotypes between the two SNPs may be replaced with the frequency of the specified haplotype.
  • “fading processing”, which is a haplotype specifying means is known.
  • ⁇ Fading processing methods are roughly divided into the following two methods.
  • A Method utilizing linkage disequilibrium between segregated loci (polymorphic loci) (SHAPEIT2: Delaneau et al., Improved whole chromosome phasing for disease and population genetic studies, Nature Methods, 2013; MaCH: Li et al., MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes, Genetic Epidemiology, 2010)
  • This method is a method of statistically fading, usually using genotype data of a population of 1000 or more, and is highly accurate at loci with mutations having high allele frequency (5% or more), but low allele frequency. As for the locus, the accuracy tends to be low due to the lack of the number of data, and in order to obtain high accuracy, the genotype of a huge sample population is required.
  • a SNP group in the human genome database as a population, and SNPs existing in the vicinity within a certain range from the locus of each SNP that is a Tag SNP candidate are set as Target SNPs, and the Tag SNP candidates and these Target SNPs To calculate the sum of mutual information.
  • Mutual information is defined by the following equation when two random variables x and y follow probability distributions P (x) and P (y), and the joint probability of x and y follows P (x, y). Amount.
  • x and y are two different SNP genotypes, and p (x) and p (y) correspond to their frequencies.
  • p (x, y) is the frequency with which these genotypes are observed simultaneously in two SNPs.
  • the “mutual information amount between Tag SNP candidate and Target SNP” can be calculated.
  • the genotype of each TargetSNP existing within the vicinity defined within a certain range from the locus of each TagSNP candidate It is also necessary to calculate the frequency at which the two are simultaneously observed.
  • the haplotype of the SNP group is specified, the genotype frequency is replaced with the allele frequency constituting the genotype, and the frequency at which genotypes are simultaneously observed in the two SNPs is replaced with the haplotype frequency. That's fine.
  • Tag SNP candidates having a large value of the sum of mutual information among all Tag SNP candidates, as Target SNPs to be present in the nucleic acid probe used as the means for imputation, in the descending order of the sum.
  • the TargetSNP group is configured by being narrowed down in advance by an index other than the mutual information amount, from the viewpoint of improving the efficiency of TagSNP selection.
  • the “r 2 linkage disequilibrium value (R square value or R ⁇ 2)” that is particularly suitable among them is Pearson's correlation coefficient regarding the frequency of genotype between two SNPs, and a value of 0 to 1 Is an index indicating that there is a stronger linkage disequilibrium as the value approaches 1.
  • the haplotype of the SNP group is specified, the genotype frequency is replaced with the allele frequency constituting the genotype, and the frequency at which genotypes are simultaneously observed in the two SNPs is replaced with the haplotype frequency. That's fine.
  • the selection method of the present invention can be efficiently performed by pre-selecting a Target SNP group having a linkage disequilibrium value greater than a certain value in the linkage disequilibrium value such as the r 2 linkage disequilibrium value.
  • the threshold for selection of r 2 linkage disequilibrium values has been described above. Furthermore, the “neighboring defined within a certain range” and the “number of Tag SNPs to be selected” are also described above. The “addition of other SNPs” is also described above.
  • Computer system and computer program of the present invention The computer system of the present invention is a system that serves as means for performing the above-described selection method of the present invention, and the program of the present invention selects the present invention in the computer system of the present invention.
  • a computer program with an algorithm for performing the method “Algorithm” means a formalized representation of a procedure for solving a problem, as in the general concept of the computer field.
  • the computer system of the present invention can include hardware related to a normal computer system. That is, in addition to a “recording unit” corresponding to a normal hard disk drive and an “arithmetic processing unit” corresponding to a CPU, for example, a “temporary storage unit” corresponding to a RAM, an “operation unit” corresponding to a keyboard, a mouse, a touch panel, etc.
  • a display unit corresponding to a display, an input / output interface (IF) unit corresponding to a serial or parallel interface corresponding to an operation unit, a video memory and a D / A converter,
  • a “communication interface (IF) unit” that outputs a corresponding analog signal is provided.
  • the communication IF unit can exchange data with external information, particularly human genome information such as a human genome database.
  • the “arithmetic processing unit” is operated by operating the “operation unit” to acquire the data of the human genome database in particular through the “communication IF unit” and record the data in the “recording unit”. After reading into the “temporary storage unit” and performing a predetermined process, the result is recorded in the “recording unit” again.
  • the “arithmetic processing unit” creates screen data for prompting operation of the “operation unit” and screen data for displaying the processing results, and displays these images on the “display unit” via the video RAM of the input IF unit. To do.
  • the program of the present invention is used or recorded in advance in the “recording unit”, or recorded in an external hardware resource, and in the “arithmetic processing unit” according to the algorithm described in the “arithmetic processing unit” as necessary. Is done.
  • FIG. 1 is a flowchart outlining the contents of the program of the present invention
  • FIG. 2 is a flowchart more specifically expressing FIG. Step S1 is common to FIG. 1 and FIG. 2, “Target SNP, Tag SNP candidates, and genes of those loci from an input file containing information on the site (chromosome, position) for each SNP and the genotype of each individual. This is a step of “reading a type”.
  • this input file is a mutation found in a reference panel, that is, a data file of 1070 Japanese full-length genomes determined by NGS (Next Generation Sequencer) at Tohoku Medical Megabank Organization (ToMMo).
  • NGS Next Generation Sequencer
  • ToMMo Toma Medical Megabank Organization
  • This step S1 describes the first function of the program of the present invention. That is, this step S1 is performed in human genome information including a plurality of individual genotypes.
  • A a locus on the human genome of each TagSNP candidate;
  • B Genotype of TagSNP candidate in individual human genome information,
  • C a locus on the human genome of TargetSNP,
  • D Target SNP genotype in individual human genome information, Describes a “first function” for reading the information (a) to (d) for processing in the arithmetic processing unit from the recording unit in which is recorded.
  • step S1 it is possible to provide a step for giving priority to “other SNPs” as a step before step S1.
  • This pre-feeding step is preferably provided as an alternative to the post-feeding step described later.
  • Step S1 'shown in FIG. 2 shows an initial setting state for the Tag SNP and Target SNP that are selected thereafter.
  • Step S2 in FIG. 1 is a step of “calculating scores for all unselected Tag SNP candidates” from the human genome information read from the recording unit in step S1.
  • Step S2 the first half of the second function of the program of the present invention is described.
  • Step S2-1 (1) / (2) and Steps S2-3 (1) / (2) are each a set of loop ends.
  • step S2 based on the information of (1) to (4) read out by the first function, the sum of mutual information amounts with the Target SNP corresponding to each Tag SNP candidate is calculated. A function that uses the sum as a score is described.
  • the mutual information amount is an information concept calculated by numerical calculation of the above-described contents. As a premise for calculation, in addition to the frequency of the genotype of each Tag SNP candidate, the mutual information amount is within a certain range from the locus of each Tag SNP candidate. It is also necessary to calculate the frequency of the combination of the TagSNP candidate and the target SNP candidate genotype in each Target SNP existing in the determined neighborhood, and these frequency calculations are preferably performed in this step S2. It is.
  • the present example shows a preferred embodiment for performing the threshold that defines the lower limit of r 2 linkage disequilibrium values TargetSNP narrowing of calculating the mutual information for each TagSNP (R ⁇ 2).
  • the method for calculating the r 2 linkage disequilibrium value is as described above, and the preferable range of the threshold is also as described above, but in the examples described later, “r 2 > 0.8” was used as the threshold.
  • Step S2-1 (1) shown in FIG. 2 is the first loop end for selecting M TagSNP candidates “i” one by one.
  • Step S2-3 (1) is the first loop end for selecting N TargetSNPs “j” one by one.
  • Step S2-4 indicates that this is a step for determining whether or not to calculate a score.
  • the distance (bp) “L0” is equal to or less than a specific value. That is, “L0” indicates a distance within a predetermined range from the locus of the TagSNP candidate. The distance is as described above.
  • the threshold value is also as described above.
  • step 2-4 if the condition in the determination box is “Yes”, the process proceeds to the next step S2-5. If “No”, the process returns to step S2-3 (1) again. It is described as a step.
  • Step S2-5 is a step of calculating a score when “Yes” is determined in Step S2-4 and adding the value to the TagSNP candidate “i”.
  • the “score” is the mutual information amount between the TagSNP candidate “i” and the TargetSNP “j” covered as a pair.
  • Step S2-3 (2) is the loop termination of step S2-3 (1) for selecting the above-described Target SNP
  • step S2-1 (2) is step S2-1 (1) for selecting the above-described Tag SNP candidate.
  • Loop end update the TagSNP candidate and TargetSNP pairs to be scrutinized.
  • Step S3 shown in FIG. 1 is a step of “selecting a Tag SNP candidate having the maximum score calculated in step S2”.
  • This step S2 describes the second half of the second function of the program of the present invention.
  • Step S3-2 (1) / (2) is a set of loop ends.
  • step S3-1 the number of the TagSNP candidate having the maximum score calculated in step S2 is set to “k”, and one of the above S value rows is set to “1” as the TagSNP to be selected.
  • step S3-4 the score the maximum TagSNP “k” at the moment, when r 2 linkage disequilibrium values between one TargetSNP in TargetSNP group "j" corresponding to this is the threshold value "R0 or more”
  • step S3-4 the score the maximum TagSNP “k” at the moment, when r 2 linkage disequilibrium values between one TargetSNP in TargetSNP group "j” corresponding to this is the threshold value "R0 or more”
  • step S3-3 the flow returns to step S3-2 (1)
  • the target SNP “j” is not covered and the same confirmation is performed for the next Target SNP.
  • Step S4 is common to FIGS. 1 and 2 and is a step of “determining whether the total number of selected Tag SNP candidates has reached the predetermined number”. In FIG. 2, it is described as a determination that the number of mounted devices is “S0”.
  • This step S4 describes the third function of the program of the present invention. That is, based on the Tag SNP information and Target SNP information from which the information of the Target SNP group selected in Steps S2 and S3 performing the second function has been extracted, the sum of the maximum mutual information amount is again obtained in Steps S2 and S3. (As mentioned above, in this example, the Tag SNP with the threshold of the r 2 linkage disequilibrium value is selected again and selected as the second Tag SNP. Thereafter, Steps S2 and S3 are repeated, A third function is described in which this iterative process is performed until reaching the “predetermined number of means for imputing a DNA microarray for SNP detection”.
  • a step for giving priority to “other SNP” can be provided as a step after step S4.
  • This post-feeding step is preferably provided alternatively to the above-described pre-feeding step.
  • the program of the present invention can be described in, for example, C language, Java (registered trademark), Perl, Python, etc., and can be a multi-platform.
  • the program of the present invention can be stored in a computer-readable recording medium or a recording medium that can be connected to a computer, and these recording media are also provided as the storage medium of the present invention.
  • these recording media include magnetic media such as flexible disks, flash memories, and hard disks, optical media such as CD, DVD, and BD, magneto-optical media such as MO and MD, and the like. is not.
  • the array of the present invention is equipped with a nucleic acid probe corresponding to the TagSNP information (first step) selected using the selection method or computer system of the present invention described above (second step). ), That is, (a) a first step of selecting a TagSNP according to the selection method of the present invention; (b) detecting the genotype of the TagSNP in the human genome in the sample based on the TagSNP selected by the first step For producing the nucleic acid probe for the DNA microarray.
  • second step known methods can be widely used, and new means for producing a DNA microarray to be provided in the future can be used as long as the effects of the present invention are not impaired.
  • Nucleic acid probes can be prepared using, for example, gene amplification methods such as PCR and RNA PCR, DNA chemical synthesis, etc., using appropriate amplification primers for the base sequence of the human genome containing the target SNP base. By carrying out the procedure, a DNA fragment serving as a probe group can be obtained.
  • the base length of the DNA fragment is not particularly limited, but is 10 to 100 bases, more preferably 10 to 40 bases. If the base length of the DNA fragment is long, the ability of the probe to capture the target nucleic acid containing the SNP base increases, but it tends to be unsuitable for high-density DNA microarrays. On the other hand, if the base length is short, the target nucleic acid capturing ability tends to be poor.
  • the base length of the nucleic acid probe mounted on the DNA microarray can be designed and manufactured.
  • the DNA fragment may be modified, and a known modification method can be used. What is used for modification may be appropriately used those used in this field, such as various fluorescent dyes and coloring dyes, and is not limited thereto.
  • nucleic acid probe capable of generating a capture signal on a DNA microarray by capturing a tag SNP selected based on the present invention as a target and contacting with a sample-derived DNA sample is prepared.
  • a DNA microarray on which a desired nucleic acid probe is mounted by attaching and immobilizing a nucleic acid probe prepared in advance on a carrier.
  • the carrier include a solid phase carrier made of a material such as glass, plastic (eg, polypropylene, nylon, etc.), polyacrylamide, nitrocellulose, gel, other porous material or non-porous material.
  • Examples of the method for attaching the nucleic acid probe to the surface of the carrier include a printing method on a plate. Furthermore, as a technique for producing a high-density array, a technique for generating an array containing thousands of oligonucleotides complementary to a defined sequence at a defined position on the surface in situ using photolithography synthesis technology, Examples thereof include a method of rapidly synthesizing a predesigned DNA strand and directly attaching it to a carrier, and it is also possible to produce a DNA microarray using a masking technique. It can also be produced by an inkjet printing apparatus for oligonucleotide synthesis, and a DNA microarray using fluorescent beads or magnetic beads can be produced.
  • the array of the present invention produced as described above detects the presence of base substitution in the Tag SNP selected by the present invention in the DNA specimen by contacting with the DNA specimen as a signal of each spot, so that the SNP can be detected. It can be confirmed whether it is a homotype or a heterotype.
  • TargetSNP information that is not mounted on the DNA microarray, that is, other than TagSNP. Is possible.
  • the DNA sample to be used is not particularly limited as long as it is a target from which human genomic DNA can be obtained even in a small amount. For example, blood, saliva, urine, feces, sweat, nails, hair, skin, oral tissue, semen, marrow Fluid, lymph and the like.
  • a DNA sample can be obtained by purifying the genomic DNA in these original samples.
  • Example 1 Selection of Tag SNP As described above, a chromosomal site in which mutations were found in a data file of the entire genome of 1070 Japanese people determined using NGS (Next Generation Sequencer) at Tohoku Medical Megabank Organization (ToMMo) The computer program having the contents shown in FIG. 1 was executed on the file composed of the above information to select TagSNPs to be included in the nucleic acid probe mounted on the DNA microarray.
  • NGS Next Generation Sequencer
  • ToMMo Toma Medical Megabank Organization
  • the threshold value of “r 2 linkage disequilibrium value” used for narrowing down prior Tag SNP candidates is “r 2 > 0.8”, and “neighboring defined within a certain range”
  • the selection method of the present invention was carried out at ⁇ 500 kbp from the locus.
  • the number of TagSNPs used for nucleic acid probes to be mounted on the DNA microarray is 675,000.
  • the current Tag SNP candidate and Target SNP were selected from about 9.4 million SNP groups that have been analyzed in advance in Affymetrix DNA microarrays, but such prior selection is not necessarily performed.
  • the selection method of the present invention can be performed by randomly imagining a Tag SNP group and a Target SNP group from an arbitrary SNP group. It is also an efficient means to exclude a SNP having a low MAF from Tag SNP candidates in advance.
  • the selection method of the present invention can also be performed based on an existing list of TagSNPs.
  • 675,000 Tag SNP groups selected as described above are used for 131 SNP genotypes different from the above 1070.
  • the performance was evaluated by imputation.
  • NGS was used to identify the SNP loci and the genotypes of 131 individuals, and the genotype information corresponding to the 675,000 Tag SNP groups selected in this example was selected from the SNP loci.
  • specifying the genotype corresponding to the TagSNP group based on the analysis result of NGS corresponds to specifying the genotype using a DNA microarray.
  • 131 SNP genotypes were estimated for 131 human genotypes corresponding to this Tag SNP group by referring to the aforementioned human genome information for 1070 people (imputation).
  • the square (r 2 ) of the correlation coefficient between 131 genotypes estimated by imputation and the genotype specified by NGS was calculated. If the result of the estimated results are identified experimentally (NGS, etc.) is completely consistent with 131 people all, r 2 becomes 1.0, will be fully estimate the true genotype, reverse the smaller the value of about r 2 more different estimation has been made subject to the true genotype.
  • the average of r 2 calculated in this way to evaluate the selection result of Tag SNP was calculated as the average value for each MAF of the SNP to be estimated.
  • the average value of r 2 of the MAF1 ⁇ 3% of the SNP is 0.81, MAF3 ⁇ 5% in the SNP 0.88, the results show a very good imputation performance of 0.96 in MAF5% or more was gotten.
  • Example 4 Examples 4-1 and 4-2 of Japanese Patent Application No. 2014-223834.
  • Example 2 Comparison with existing commercial DNA microarray (1) As a comparison with the above examples, using the SNPs mounted on the existing commercial DNA microarray, the genotypes of 131 SNPs as in this example were estimated by imputation. As a result, Illumina's Human Omni 2.5-8 (hereinafter, simply referred to as OMNI2.5) average of MAF1 ⁇ 3 percent of SNP r 2 is imputation using SNP information of 0.80, MAF3 ⁇ It was 0.87 for 5% SNP and 0.96 for MAF over 5%. This result is an imputation performance almost equivalent to that of the above example, but the number of SNPs mounted on the commercial DNA microarray is about 2.3 million (exactly 2,338,671).
  • OMNI2.5 Illumina's Human Omni 2.5-8
  • the genotype of the SNP is much higher efficiency than when the SNP mounted on the existing commercial DNA microarray is used. It was shown that there is a great advantage in that it can be estimated.
  • Example 3 Comparison with existing commercial DNA microarray (2) Next, verification of how much the imputation performance is recognized with the number of mountings less than the above 675,000, in addition to the number of TagSNPs of 675,000, 300,000 (hereinafter, 30 Abbreviated as 10,000), 400,000 (hereinafter abbreviated as 400,000), 500,000 (hereinafter abbreviated as 500,000), and 600,000 (hereinafter abbreviated as 600,000). In the case of MAF 1 to 3%, 3 to 5%, and 5% or more. The results are shown in Table 1. In Example 4-1 of Japanese Patent Application No. 2014-223834, “300,000”, Example 4-2-1 “400,000”, and Example 4-2-2 “500,000”. In Example 4-2-3, “600,000” and Example 4-2-4 “675,000” were specifically disclosed TagSNPs used here. Yes.
  • the probe is mounted to about 1/10 compared to the number of 2.3 million OMNI2.5 probes mounted. It was revealed that a DNA microarray having almost the same performance as OMNI2.5 can be designed even if the number is reduced.
  • Step S1 describing the first function of the program of the present invention S1 ′... Step S2 describing the initial setting state of Tag SNP and Target SNP selected after S1.
  • Step S2-1 (1) describing the first half of the function of Step S2-1 describing the function as the first first loop end in S2
  • Step S2 describing initialization of the TagSNP candidate -3 (1)
  • ... step S2-4 describing the function as the second first loop end in S2
  • step S2-5 describing the determination as to whether or not to calculate the score
  • Step S2-3 (2) describing addition of the score of the TagSNP for which the score has been calculated
  • Step S2-1 (2) describing that it is the end of the loop of S2-3
  • Step S3 describing that it is the end of the loop of S2-1 (1) above.
  • Step S3-1 describing the selection of one TagSNP candidate with the maximum score calculated in S2.
  • Step S3-2 (1) describing the maximum Tag SNP candidate number
  • Step S3-3 describing the function as the first loop end in S3 Judgment whether or not to perform update description in the next step
  • Step S3-4 for describing the step S3-2
  • Step S4 for describing the end of the loop of the above S3-2

Abstract

The present invention addresses the problem of discovering a means for achieving the more proper selection of a Tag SNP to be contained in a nucleic acid probe, which is a probe contained in a DNA microarray or the like and used as a means for carrying out imputation, in the imputation of SNPs. Specifically, it is found that the problem can be solved by a method for selecting a Tag SNP that is used as a means for imputing information on SNPs of human genome using human genome information that includes information on a group of SNPs, in which genotypes of multiple individuals are specified, for the purpose of constituting a group of nucleic acid probes corresponding to the Tag SNP, wherein the sum total of mutual information amounts between Tag SNP candidates and Target SNPs for the candidates is employed as a measure for the selection of the Tag SNP. Thus, provided are: a computer system and a computer program which are developed on the basis of the above-mentioned principle; a DNA microarray which is equipped with a group of nucleic acid probes corresponding to a Tag SNP selected by the aforementioned means; and a method for producing the DNA microarray.

Description

TagSNPの選択方法、選択用コンピュータシステム、及び、選択用ソフトウエア、並びに、当該選択方法を用いて選択されたTagSNPに対応する核酸プローブを搭載したDNAマイクロアレイTag SNP selection method, selection computer system, selection software, and DNA microarray equipped with a nucleic acid probe corresponding to Tag SNP selected using the selection method
 本発明は、核酸に基づく遺伝的解析の分野に関する発明であり、より詳しくはヒトゲノムの一塩基多型(SNP)に関する情報に基づき、より少ないSNP情報から、より良い確度で、個別のヒトゲノムにおけるSNP全体情報を導き出すための手段を提供する発明である。 The present invention is an invention related to the field of genetic analysis based on nucleic acids, and more specifically, based on information about single nucleotide polymorphisms (SNPs) in the human genome, from less SNP information, with better accuracy, SNPs in individual human genomes It is an invention that provides means for deriving overall information.
 我々の顔貌や体型、さらには性格が千差万別であるように、遺伝暗号である塩基配列も一人一人でかなりの部分で異なっていることが知られている、一般的には、この遺伝暗号の違いは多型(polymorphism)と呼ばれている。多型はいくつかの種類が知られているが、その中でもSNPが特に現在、いわゆるオーダーメイド医療との関係で着目されている。 It is known that the base sequence, which is the genetic code, differs from person to person in a considerable part so that our face, body shape, and personality are all different. The difference in cryptography is called polymorphism. Several types of polymorphisms are known, and among them, SNPs are currently attracting attention in relation to so-called custom-made medicine.
 一方、これまでの医療は、病気の原因を調べたり治療法を開発したりすることに主力が注がれてきている。しかしながら、個々人によって皆同じような治療効果が顕れないことも現実として知られている。 On the other hand, in the past medical care has been focused on investigating the cause of diseases and developing treatments. However, it is also known as reality that the same therapeutic effect is not manifested by every individual.
 オーダーメイド(テイラーメイド)医療とは、治療手段を画一的に捉えるのではなく、患者個々人の体質に適した治療法を、いわばオーダーメイドで適用する医療のことを意味している。この患者個々人の体質を知る上で本質的な要素となるのは、個々人の遺伝情報であり、現在ヒトゲノムの解読を通じて様々な遺伝情報と体質や病気との関連性が明らかになりつつある。このような中で、現在最も着目されているヒトの遺伝要素として、SNPが挙げられる。 “Custom-made (tailor-made) medicine” means medical treatment that applies treatment methods suited to the individual patient's constitution in a custom-made manner, rather than capturing treatment methods uniformly. An essential element in knowing the constitution of each individual patient is the genetic information of the individual, and the relationship between various genetic information and constitutions and diseases is now being clarified through the decoding of the human genome. In such a situation, SNP is one of the human genetic elements that are currently focused on.
 SNPとは、single nucleotide polymorphismの略であり、個人間における1塩基の違いを意味している(一塩基多型)。遺伝子の多型の中で最も数多く存在しているものがSNPであり、ヒトゲノムには3000万個以上のSNPが存在することが推定されている。そして、SNPはヒトの個体差を考えるときに最も重要な要素の一つとなるものとして認識され、現在、SNPは、病気や体質との関係、薬剤の効き目との関係等についての解析がなされており、多くの成果が上がっている。 SNP is an abbreviation for single nucleotide polymorphism, which means a single base difference between individuals (single nucleotide polymorphism). Among the polymorphisms of genes, SNPs are the most numerous, and it is estimated that there are 30 million or more SNPs in the human genome. SNP is recognized as one of the most important factors when considering individual differences between humans. Currently, SNP has been analyzed for its relationship with illness and constitution, its relationship with the efficacy of drugs, and so on. Many results have been achieved.
 仮に、個人的な遺伝子解析がSNPを中心として行われ、その結果、当該個人の遺伝的な傾向、例えば、高血圧、糖尿病、癌、心臓病、脳卒中等の生活習慣との関連が大きいと言われている疾患についての罹患しやすさを特定することができれば、事前の積極的な食事や運動等の生活指導を行うことにより、予め予防措置を施すことも可能となり、人生を実り多いものにする手助けになるばかりか、医療費の増大の歯止めになることも期待されている。また、病気になった場合でも、事前に薬剤に対する有効性や副作用の危険性が、SNP解析により判明していれば、無駄な投薬や危険な投薬を事前に回避することが可能である。 Temporarily, personal genetic analysis is performed mainly on SNPs, and as a result, it is said that the genetic tendency of the individual, such as high blood pressure, diabetes, cancer, heart disease, stroke, and other lifestyle habits is large. If it is possible to identify the susceptibility to a disease, it is possible to take precautionary measures in advance by providing lifestyle guidance such as active meals and exercise in advance, making life more fruitful. In addition to helping, it is also expected to stop the increase in medical expenses. Even in the case of illness, it is possible to avoid unnecessary medication or dangerous medication in advance if the effectiveness of the drug and the risk of side effects are known in advance by SNP analysis.
 その一方で、このような個人的な体質に直接的に関与しているSNPは1種類ではなく、実は複数のSNPが様々関連していることが判明しつつあり、SNPの解析は網羅的であることが好適であることが次第に明らかになっている。 On the other hand, it is becoming clear that there is not one kind of SNP that is directly involved in such a personal constitution, and in fact, a plurality of SNPs are related, and the analysis of SNPs is comprehensive. It has become increasingly clear that it is preferred.
 このような事情から、遺伝子の網羅的な解析手段として用いられているDNAマイクロアレイを、ヒトゲノムにおけるSNP解析に応用しようとする試みが現在行われている。 Under such circumstances, attempts are now being made to apply DNA microarrays, which are used as a comprehensive analysis method of genes, to SNP analysis in the human genome.
 DNAマイクロアレイを用いてSNPの解析を行う場合、まず問題となるのはDNAマイクロアレイに搭載するべきSNPの核酸プローブの数である。SNPの核酸プローブ(以下、「核酸プローブ」ともいう)は、実質的にはSNP塩基を含むヒトゲノム上の塩基配列断片又はその相補鎖を含むものである。現在知られているSNPだけでも3000万個以上あり、これらに対応する核酸プローブを網羅的にDNAマイクロアレイに搭載してSNPの検出に広く用いることは、技術的にも、コスト的にも現状では困難である。 When performing SNP analysis using a DNA microarray, the first problem is the number of SNP nucleic acid probes to be mounted on the DNA microarray. The SNP nucleic acid probe (hereinafter also referred to as “nucleic acid probe”) substantially includes a nucleotide sequence fragment on the human genome containing SNP bases or a complementary strand thereof. Currently known SNPs alone have more than 30 million, and it is technically and costly that the corresponding nucleic acid probes are comprehensively mounted on DNA microarrays and widely used for SNP detection. Have difficulty.
 そこで、DNAマイクロアレイに搭載する核酸プローブをヒトの体質や疾病等に関係があるものに限定し、かつ、インピュテーション(imputation)と呼ばれる処理を行うことにより、当該搭載対象となる核酸プローブを絞り込む試みがなされている。 Therefore, the nucleic acid probes to be mounted on the DNA microarray are limited to those related to the human constitution or disease, and the nucleic acid probes to be mounted are narrowed down by performing a process called imputation. Attempts have been made.
 この試みはゲノム内のSNPは互いに相関していることに着目したものである。高い相関を伴うSNPは限られた領域(ハプロタイプブロック)に集中しており、インピュテーションは、ハプロタイプブロックの中から適切なSNP(TagSNP)を選び出せば、TagSNPと強く相関するSNP(TargetSNP)は、実験によるタイピングを行わなくても高い確率で遺伝子型を推定することができるという考え方に基づく、DNAマイクロアレイに搭載する対象とするSNPの絞り込みのための技術である。 This attempt focuses on the fact that SNPs in the genome are correlated with each other. SNPs with high correlation are concentrated in a limited region (haplotype block), and if the imputation selects an appropriate SNP (TagSNP) from the haplotype blocks, the SNP (TargetSNP) that strongly correlates with TagSNP Is a technique for narrowing down SNPs to be mounted on DNA microarrays based on the idea that genotypes can be estimated with high probability without performing typing by experiments.
 上記の先行技術文献1は、TagSNP候補からTargetSNPとの関連性を利用して、適切に高い確率の連鎖を伴うTagSNPを選択する試みが開示されている。 The above-mentioned prior art document 1 discloses an attempt to select a TagSNP with an appropriately high probability chain using the relationship with the TargetSNP from TagSNP candidates.
 しかしながら現状では、推定精度が高いSNP検出用のDNAマイクロアレイにおける核酸プローブの搭載数は100万個を超えておりコスト高となっている。一方で、それより少ない搭載数では推定精度が落ちてしまい、的確なSNPに基づく疾病等の予測性を提供するためには問題がある。 However, at present, the number of nucleic acid probes mounted on a DNA microarray for SNP detection with high estimation accuracy exceeds 1 million, and the cost is high. On the other hand, if the number is smaller than that, the estimation accuracy is lowered, and there is a problem in providing predictability of diseases and the like based on accurate SNPs.
 本発明はこのインピュテーションを行うに際して、SNP検出用のDNAマイクロアレイ等において、インピュテーションを行うための手段として用いる核酸プローブに含まれるTagSNPを、より適切に選択する手段を見出すことを課題とする発明である。 It is an object of the present invention to find a means for more appropriately selecting a Tag SNP included in a nucleic acid probe used as a means for performing imputation in a DNA microarray for SNP detection or the like when performing this imputation. It is an invention to do.
 本発明者らは、RNAの二次構造の予測や、医用画像処理における画像の位置合わせ等に活用されている「相互情報量」を、適切なTagSNPの選別の指標として用いることについての検討を行ったところ、驚くべきことに、SNP検出用のDNAマイクロアレイ等に用いる、TagSNPに対応する核酸プローブ数を大幅に節約し、かつ、当該DNAマイクロアレイ等で得られた結果に基づいてインピュテーションを行うと、既存の商用DNAマイクロアレイ等と同等以上の精度を保つことが可能であることを見出し、本発明を完成した。なお上述したように、本発明において「SNP」とは、single nucleotide polymorphism(一塩基多型)の略称であり、「核酸プローブ」と同様に、単数及び複数の双方を意味する。「SNP群」や「核酸プローブ群」における「群」とは、概念的には多数のSNPや核酸プローブの存在を意味するものであるが、厳密には複数、すなわち2個以上のSNPや核酸プローブの存在を意味するものでもある。また、「TagSNPに対応する核酸プローブ」とは、当該SNPを特定するための核酸プローブであり、発明を実施する形態の項目(3)の「本発明のアレイ」の欄にて具体的に開示されるものである。 The present inventors have examined the use of “mutual information” used for prediction of RNA secondary structure and image alignment in medical image processing as an index for selecting appropriate TagSNPs. Surprisingly, the number of nucleic acid probes corresponding to Tag SNP used for DNA microarrays for SNP detection was greatly reduced, and imputation was performed based on the results obtained with the DNA microarrays. As a result, the present inventors have found that it is possible to maintain an accuracy equal to or higher than that of an existing commercial DNA microarray or the like, and completed the present invention. As described above, “SNP” in the present invention is an abbreviation for single nucleotide polymorphism (single nucleotide polymorphism) and means both singular and plural, like “nucleic acid probe”. The “group” in the “SNP group” or “nucleic acid probe group” conceptually means the presence of a large number of SNPs or nucleic acid probes, but strictly speaking, a plurality, that is, two or more SNPs or nucleic acids. It also means the presence of the probe. The “nucleic acid probe corresponding to Tag SNP” is a nucleic acid probe for specifying the SNP, and is specifically disclosed in the column of “array of the present invention” in item (3) of the embodiment of the invention. It is what is done.
 本発明は下記の内容の発明を提供するものである。 The present invention provides the following inventions.
 本発明は第1に、複数個人の遺伝子型が特定されたSNP群の情報が含まれるヒトゲノム情報を用いて、ヒトゲノムのSNP情報をインピュテーションするための手段として用いるTagSNPに対応する核酸プローブ群を構成するために当該TagSNPを選択する方法であって、
 a)当該ヒトゲノム情報中のSNP群を母集団として、その中でTagSNP候補となる各々のSNPの遺伝子座から一定範囲に定められた近傍に存在するSNPをTargetSNPとして、当該TagSNP候補とこれらのTargetSNPとの間の相互情報量の和を算出し、
 b)全TagSNP候補の中から、前記相互情報量の総和の値が大きいTagSNP候補を、上記のインピュテーションするための手段として用いる核酸プローブ中に存在させるTagSNPとして、当該総和の大きい順に選択する、
 ことを特徴とする、TagSNPの選択方法(以下、本発明の選択方法ともいう)を提供する発明である。
First, the present invention uses nucleic acid probe groups corresponding to Tag SNPs used as means for imputing human genome SNP information using human genome information including information of SNP groups in which genotypes of a plurality of individuals are specified. A method for selecting the TagSNP to configure
a) Using the SNP group in the human genome information as a population, the SNPs present in the vicinity defined within a certain range from the locus of each SNP as a Tag SNP candidate are Target SNPs, and the Tag SNP candidates and these Target SNPs And calculate the sum of mutual information between
b) From all TagSNP candidates, select TagSNP candidates having a large sum of mutual information amounts as TagSNPs to be present in the nucleic acid probe used as the means for imputation, in order of increasing sum. ,
The present invention provides a method for selecting TagSNP (hereinafter also referred to as the selection method of the present invention).
 本発明は第2に、本発明の選択方法に従い選択されたTagSNPに対応する核酸プローブが搭載されていることを特徴とする、DNAマイクロアレイ(本発明のアレイともいう)を提供する発明であり、本発明のアレイは、下記の工程(1)及び(2)を含むことを特徴とする、DNAマイクロアレイの生産方法(以下、本発明のアレイの生産方法ともいう)により生産することができる。
 (1)本発明の選択方法に従い、TagSNPを選択する第1工程;
 (2)第1工程により選択されたTagSNPに基づいて、検体中のヒトゲノム中の当該TagSNPの遺伝子型を検出するための核酸プローブを、DNAマイクロアレイに搭載する第2工程。
Second, the present invention provides a DNA microarray (also referred to as an array of the present invention), characterized in that a nucleic acid probe corresponding to a TagSNP selected according to the selection method of the present invention is mounted. The array of the present invention can be produced by a DNA microarray production method (hereinafter also referred to as the array production method of the present invention), which comprises the following steps (1) and (2).
(1) a first step of selecting a TagSNP according to the selection method of the present invention;
(2) A second step of mounting a nucleic acid probe for detecting the genotype of the TagSNP in the human genome in the sample on the DNA microarray based on the TagSNP selected in the first step.
 本発明は第3に、下記のコンピュータシステム(以下、本発明のコンピュータシステムともいう)を提供する発明である。すなわち本発明のコンピュータシステムは、複数個人の遺伝子型が特定されたSNP群の情報が含まれるヒトゲノム情報を用いて、ヒトゲノムのSNP情報をインピュテーションするための手段として用いるTagSNPに対応する核酸プローブ群を構成するために当該TagSNPを選択するコンピュータシステムであって、記録部と演算処理部とを備え;
(A) 当該記録部には、当該ヒトゲノム情報から読み出されたTagSNP候補の情報、及び、それらのTagSNP候補の遺伝子座から一定範囲に定められた近傍に存在するSNPの情報をTargetSNP情報として、
 (1)各々のTagSNP候補のヒトゲノム上の遺伝子座、
 (2)個々のヒトゲノム情報におけるTagSNP候補の遺伝子型、
 (3)TargetSNPのヒトゲノム上の遺伝子座、
 (4)個々のヒトゲノム情報におけるTargetSNPの遺伝子型、
 が少なくとも記録されており;
(B) 当該演算処理部は、前記記録部から(A)の(1)~(4)の情報に基づいて個々のTagSNP候補毎に対応するTargetSNPとの間の相互情報量の和を計算し、これらの中で当該和が最大のTagSNP候補を選択して、第一のTagSNPとして選択を行い;
(C) これまでに選択されたTagSNPと対応するTargetSNP群の情報が抜去された、前記TagSNP情報及びTargetSNP情報を基にして、再度前記(B)工程により最大の相互情報量の和を伴うTagSNP候補を選択して、第二のTagSNPとして選択を行い;
(D) 前記工程(B)、(C)を繰り返して、この繰り返し工程を第M(Mは自然数)のTagSNPの選択のために行い、この自然数Mの値が、定められたインピュテーションするための手段として用いる、選択されたTagSNPに対応する核酸プローブの予定数に達するまで、残りM-2回の当該繰り返し工程を行う;
 ことを特徴とする、TagSNPを選択するコンピュータシステムである。
Third, the present invention provides the following computer system (hereinafter also referred to as the computer system of the present invention). That is, the computer system of the present invention uses a nucleic acid probe corresponding to a Tag SNP used as means for imputing SNP information of a human genome using human genome information including information of SNP groups in which genotypes of a plurality of individuals are specified. A computer system for selecting the TagSNP to form a group, comprising a recording unit and an arithmetic processing unit;
(A) In the recording unit, information on TagSNP candidates read from the human genome information, and information on SNPs present in the vicinity defined by a certain range from the gene loci of those TagSNP candidates are used as TargetSNP information.
(1) a locus on the human genome of each TagSNP candidate;
(2) Genotypes of TagSNP candidates in individual human genome information,
(3) the locus of TargetSNP on the human genome,
(4) Genotype of TargetSNP in individual human genome information,
Is recorded at least;
(B) The arithmetic processing unit calculates the sum of mutual information amounts with the Target SNP corresponding to each Tag SNP candidate based on the information of (1) to (4) of (A) from the recording unit. , Select the TagSNP candidate with the largest sum among them, and select it as the first TagSNP;
(C) TagSNP with the sum of the maximum mutual information by the step (B) again based on the TagSNP information and TargetSNP information from which the information of the TargetSNP group corresponding to the TagSNP selected so far has been extracted Select a candidate and select as a second Tag SNP;
(D) The steps (B) and (C) are repeated, and this repetition step is performed for selecting the Mth (M is a natural number) TagSNP, and the value of the natural number M is imputed by a predetermined imputation. Performing the remaining M-2 iterations until the expected number of nucleic acid probes corresponding to the selected TagSNP is used as a means for
This is a computer system for selecting a TagSNP.
 この「コンピュータシステム」のカテゴリーは「物」であり、「装置」として置き換えることも可能である。 This category of “computer system” is “thing” and can be replaced with “device”.
 本発明は第4に、下記のコンピュータプログラム(以下、本発明のプログラムともいう)を提供する発明である。すなわち本発明のプログラムは、複数個人の遺伝子型が特定されたSNP群の情報が含まれるヒトゲノム情報を用いて、ヒトゲノムのSNP情報をインピュテーションするための手段として用いるTagSNPに対応する核酸プローブ群を構成するために当該TagSNPを選択するコンピュータプログラムであって、コンピュータに、
(A) ヒトゲノム情報から読み出された当該TagSNP候補の情報、及び、それらのTagSNP候補の遺伝子座から一定範囲に定められた近傍に存在するSNPの情報をTargetSNP情報として、
 (1)各々のTagSNP候補のヒトゲノム上の遺伝子座、
 (2)個々のヒトゲノム情報におけるTagSNP候補の遺伝子型、
 (3)TargetSNPのヒトゲノム上の遺伝子座、
 (4)個々のヒトゲノム情報におけるTargetSNPの遺伝子型、
 が記録されている記録部から、これら(1)~(4)の情報を演算処理部における処理のために読み出す、第一の機能;
(B) 前記第一の機能により読み出された(1)~(4)の情報に基づき、個々のTagSNP候補毎に対応するTargetSNPとの間の相互情報量の和を計算し、これらの中で当該和が最大のTagSNP候補を選択して第一のTagSNPとして選択を行う、第二の機能;
(C) これまでに選択されたTagSNPと対応するTargetSNP群の情報が抜去された、前記TagSNP情報及びTargetSNP情報を基にして、再度前記第二の機能により、最大の相互情報量の和を伴うTagSNP候補を選択して、第二のTagSNPとして選択を行い、以降工程(B)、(C)を繰り返して、この繰り返し工程を第M(Mは自然数)のTagSNPの選択のため残りM-2回行い、この自然数Mの値が、定められたインピュテーションするための手段として用いる、選択されたTagSNPに対応する核酸プローブの予定数に達するまで行う、第三の機能;
 を実現させるアルゴリズムが含まれることを特徴とする、コンピュータプログラムである。
Fourth, the present invention provides the following computer program (hereinafter also referred to as the program of the present invention). That is, the program of the present invention uses nucleic acid probe groups corresponding to Tag SNPs used as means for imputing human genome SNP information using human genome information including information of SNP groups in which genotypes of a plurality of individuals are specified. A computer program for selecting the TagSNP to configure the computer,
(A) Target SNP information, which is information on the Tag SNP candidates read from the human genome information, and information on SNPs present in the vicinity defined in a certain range from the locus of those Tag SNP candidates,
(1) a locus on the human genome of each TagSNP candidate;
(2) Genotypes of TagSNP candidates in individual human genome information,
(3) the locus of TargetSNP on the human genome,
(4) Genotype of TargetSNP in individual human genome information,
A first function for reading the information of (1) to (4) for processing in the arithmetic processing unit from the recording unit in which is recorded;
(B) Based on the information of (1) to (4) read out by the first function, the sum of mutual information amounts with the Target SNP corresponding to each Tag SNP candidate is calculated. A second function that selects the TagSNP candidate with the largest sum and selects it as the first TagSNP;
(C) Based on the Tag SNP information and Target SNP information from which the information of the Target SNP group corresponding to the Tag SNP selected so far has been extracted, the second function again causes the maximum mutual information amount to be added. A Tag SNP candidate is selected and selected as a second Tag SNP, and then the steps (B) and (C) are repeated, and this repetition step is repeated for the selection of the Mth Tag SNP (M is a natural number). A third function that is repeated until the value of the natural number M reaches a predetermined number of nucleic acid probes corresponding to the selected Tag SNP, which is used as a means for imputation that has been determined;
A computer program characterized by including an algorithm for realizing the above.
 本発明はさらに、本発明のプログラムが記録されていることを特徴とする、コンピュータにおいて読み取り可能な記録媒体(以下、本発明の記録媒体ともいう)を提供する。本発明のコンピュータシステムの典型は、本発明のプログラムを実行することを特徴とするものである。 The present invention further provides a computer-readable recording medium (hereinafter also referred to as the recording medium of the present invention), in which the program of the present invention is recorded. A typical computer system of the present invention is characterized by executing the program of the present invention.
(I)本発明の選択方法とコンピュータシステムにおいて、「TagSNP候補の各々について、相互情報量の総和を算出するために用いられるTargetSNP群」は、予め相互情報量以外の指標により絞り込まれたTargetSNP群から構成されることが、選択の効率性の観点から好適である。同様の観点から本発明のプログラムでは、上記の第二の機能を実現させるアルゴリズムの前段階において、相互情報量以外の指標によりTargetSNP群を選択して、前記第二の機能を発揮する対象となるTargetSNP群の予備的な絞り込みを実現させるアルゴリズムが設けられていることが好適である。 (I) In the selection method and computer system of the present invention, “Target SNP group used to calculate the sum of mutual information amounts for each Tag SNP candidate” is a Target SNP group that has been narrowed down in advance by an index other than the mutual information amount. It is suitable from the viewpoint of selection efficiency. From the same point of view, in the program of the present invention, in the previous stage of the algorithm for realizing the second function, the Target SNP group is selected by an index other than the mutual information amount and becomes the target that exhibits the second function. It is preferable that an algorithm that realizes preliminary narrowing of the TargetSNP group is provided.
 ここで「相互情報量以外の指標」は、TagSNP候補と、当該TagSNPの遺伝子座と一定範囲に定められた近傍において存在するTargetSNPとの間の連鎖不平衡値、例えば、r連鎖不平衡値やd連鎖不平衡値等が典型的である。TagSNPの選択にあたっては、これらの連鎖不平衡値が特定の閾値より小さいSNPを除外し、その他のSNPをTargetSNPとして、TagSNP選択のための相互情報量の計算対象とすることが望ましい。上述の「相互情報量以外の指標」の中でも「r連鎖不平衡値」を用いることが好適である。この「r連鎖不平衡値」を用いる場合の当該値の閾値は0.70~0.85の範囲であることが好適である。この閾値が0.85を超えると事前の絞り込みが厳しすぎ、本来好適なTagSNP候補を選択対象から外してしまう危険性が大きくなり、0.70未満であると相互情報量の総和を計算する対象が多くなりすぎ、事前の絞り込みが緩くなって選択工程が非効率化する傾向が認められる。 Here, the “index other than mutual information amount” is a linkage disequilibrium value between a Tag SNP candidate and a Target SNP existing in the vicinity of the Tag SNP locus and a certain range, for example, an r 2 linkage disequilibrium value. And d-chain disequilibrium values are typical. When selecting Tag SNPs, it is desirable to exclude SNPs whose linkage disequilibrium values are smaller than a specific threshold value, and use other SNPs as Target SNPs as targets for calculating mutual information for Tag SNP selection. Among the “indexes other than the mutual information amount” described above, it is preferable to use the “r 2 linkage disequilibrium value”. When this “r 2 linkage disequilibrium value” is used, the threshold value of the value is preferably in the range of 0.70 to 0.85. If this threshold value exceeds 0.85, the prior narrowing down becomes too strict, and there is a greater risk of excluding the originally preferred Tag SNP candidate from the selection target. If it is less than 0.70, the target for calculating the sum of mutual information There is a tendency for the selection process to become inefficient due to the excessive increase in the number of prior refinements.
(II)本発明(選択方法、コンピュータシステム、プログラム)における、TagSNP候補の遺伝子座から「一定範囲に定められた近傍」は、当該TagSNP遺伝子座から上流及び下流へそれぞれ500kbp以内が好ましく、さらに好ましくは100~500kbpである。 (II) In the present invention (selection method, computer system, program), the “vicinity defined within a certain range” from the TagSNP candidate locus is preferably within 500 kbp upstream and downstream from the TagSNP locus. Is 100 to 500 kbp.
(III)本発明(選択方法、コンピュータシステム、プログラム)において「選択されるTagSNPの個数」は、インピュテーションするための手段として用いる核酸プローブのために選択するTagSNPの個数として、当該手段を用いたインピュテーションを行った結果が所定の性能を満たす個数以上であることが要求される。この「所定の性能」の確定指標は特に限定されるものではないが、TagSNP情報が用いられた手段のインピュテーションの性能をより客観的に反映させることが可能な指標であることが好適である。 (III) In the present invention (selection method, computer system, program), the “number of Tag SNPs to be selected” is used as the number of Tag SNPs to be selected for a nucleic acid probe used as a means for imputation. It is required that the result of performing the imputation is not less than the number satisfying the predetermined performance. The definite index of the “predetermined performance” is not particularly limited, but is preferably an index that can more objectively reflect the imputation performance of the means using the TagSNP information. is there.
 この好適な指標の一例を挙げれば、MAF(マイナーアレル頻度)5%以上のSNPについて、実験によりタイピングされた遺伝子型とインピュテーションにより推定された遺伝子座の相関係数の二乗の平均が0.94以上、好ましくは0.95以上、より好ましくは0.96以上となるような個数以上であることが挙げられる。この個数より少ないと、選択されたTagSNPのタイピング結果に基づいてインピュテーションした結果と、実際の遺伝子型との相関が従来品と比較して優れているとはいえず、本発明おいて期待される従来品に対する有用性を十分に発揮することが困難になる。さらに、MAF3~5%でのSNPのインピュテーションによる遺伝子型と実際の遺伝子型の相関係数の二乗の平均が0.82以上、好ましくは0.84以上、より好ましくは0.87以上とする指標や、MAF1~3%でのSNPのインピュテーションによる遺伝子型と実際の遺伝子型の相関係数の二乗の平均が0.73以上、好ましくは0.75以上、より好ましくは0.79以上とする指標を用いることも可能である。 As an example of this suitable index, for SNPs with MAF (minor allele frequency) of 5% or more, the average of the squares of the correlation coefficient between the genotype typed by experiment and the locus estimated by imputation is 0. 0.94 or more, preferably 0.95 or more, more preferably 0.96 or more. If it is less than this number, it cannot be said that the correlation between the result of imputation based on the selected TagSNP typing result and the actual genotype is superior to that of the conventional product, and is expected in the present invention. It becomes difficult to fully demonstrate the usefulness of the conventional product. Further, the mean square of the correlation coefficient between the genotype and the actual genotype by SNP imputation at 3 to 5% of MAF is 0.82 or more, preferably 0.84 or more, more preferably 0.87 or more. The average of the squares of the correlation coefficient between the genotype and the actual genotype by the SNP imputation at 1 to 3% of MAF is 0.73 or more, preferably 0.75 or more, more preferably 0.79 It is also possible to use the above indicators.
 当該個数の上限は特に限定されないが、本発明完成時点においては100万個以内であり、さらに70万個以内が、用いる個数による経済性とSNPに対する予測内容の確実性の双方の観点から好適である。なお、具体的な個数の下限の目安としては30万個程度である。後述する実施例において示すように、30万個であっても上記のMAFに基づく基本的な水準を上回る、優れたインピュテーションを行うことができることが明らかになっている。そして、好ましくは40万個程度以上、より好ましくは50万個程度以上、極めて好ましくは60万個程度以上が想定されるが、予定される本発明のアレイの性能に応じて、上記のMAFに基づく指標等を参照することで、適切な個数の選択を行うことができる。特願2014-223834号では、日本人における67.5万個以内のTagSNPの特定を実際に行い、当該明細書内において開示した。 Although the upper limit of the number is not particularly limited, it is 1 million or less at the time of completion of the present invention, and more than 700,000 is preferable from the viewpoints of both the economical efficiency due to the number used and the certainty of the predicted content for the SNP. is there. A specific lower limit of the number is about 300,000. As shown in the examples described later, it has been clarified that even if the number is 300,000, excellent imputation exceeding the basic level based on the MAF can be performed. Preferably, about 400,000 or more, more preferably about 500,000 or more, and very preferably about 600,000 or more are assumed. Depending on the expected performance of the array of the present invention, the above MAF may be used. An appropriate number can be selected by referring to the index or the like. In Japanese Patent Application No. 2014-223834, up to 675,000 TagSNPs in Japanese were actually identified and disclosed in the specification.
 上記の「30万個程度、40万個程度」等のSNPの個数を示す「程度」とは、「約」と同様であるが、特に、特定の個数、例えば「30万個」のTagSNPのインピュテーション性能は、ある程度の個数の幅の範囲内であれば実質的に変わらないことを示している。具体的には、特定のTagSNP数の1%以内、厳密には0.5%以内の違いであれば、実質的なインピュテーション性能に違いは無い。これは、一旦選択されたTagSNP群から、いくつかのSNPを除く場合の目安となる値である。さらに、一旦選択されたTagSNPから除かれるSNPが、実際にはインピュテーションに寄与しないものである場合には、SNPの抜去を行ってもインピュテーション性能における影響はさらに軽微になる。 The “degree” indicating the number of SNPs such as “about 300,000 or about 400,000” is the same as “about”, but in particular, a specific number of tag SNPs such as “300,000” The imputation performance does not change substantially within a certain number of widths. Specifically, if the difference is within 1% of the specific TagSNP number, strictly within 0.5%, there is no difference in substantial imputation performance. This is a value that serves as a guide when several SNPs are removed from the Tag SNP group once selected. Further, when the SNP removed from the selected Tag SNP is not actually contributing to the imputation, the influence on the imputation performance is further reduced even if the SNP is removed.
 本発明の選択方法に従って選択されたTagSNP群の中には、実際にこれらに対応する核酸プローブ化してDNAマイクロアレイに搭載した場合に、現実には本発明を適用する母集団においてSNPとして検出されずに、適切なインピュテーション性能を示さないTagSNPが僅かながら認められることが想定される。これは主に事後的な検証により明らかになるものであるが、このような機能しないSNPを、さらに用いるTagSNP群から抜去することも可能である。このような抜去されるべきSNPの数は相対的にはごく僅か(多くても0.1%程度)であるので、たとえこのような抜去を行ったとしても、上述の「インピュテーション性能が実質的に変わらない範囲」に十分に収まることになる。言い換えれば、特定の個数のTagSNPが本発明の選択方法に従って選択された場合に、その中には上述したような割合(%)に相当する個数のSNPの抜去分を見込むことが可能である。 Among Tag SNP groups selected according to the selection method of the present invention, when they are actually converted into corresponding nucleic acid probes and mounted on a DNA microarray, they are not actually detected as SNPs in the population to which the present invention is applied. In addition, it is assumed that TagSNPs that do not exhibit appropriate imputation performance are slightly recognized. This is mainly clarified by ex-post verification, but it is also possible to remove such a non-functioning SNP from the Tag SNP group to be further used. Since the number of SNPs to be extracted is relatively small (about 0.1% at most), even if such extraction is performed, the above-mentioned “imputation performance is It will be well within the “substantially unchanged range”. In other words, when a specific number of Tag SNPs are selected in accordance with the selection method of the present invention, it is possible to anticipate the number of SNPs extracted corresponding to the ratio (%) as described above.
(IV)本発明の選択方法とコンピュータシステムの実行に際して用いられる「ヒトゲノム情報」は、ヒトゲノムデータベースの情報、例えば、国際1000人ゲノムプロジェクトの全人類を対象にしたデータベースを基に行うことも可能であるが、カテゴリーをより小さくしたヒトゲノム情報に基づくことによって、TagSNPを基にしたSNPの推定の確度は向上する傾向がある。好適には、アジアのモンゴロイド、さらに細かくは、日本人、中国人、マレー人、ポリネシア人、ミクロネシア人等;コーカソイド、さらに細かくは、イタリア人、イギリス人、イラン人、インド人、ラップ人等;アメリンド、さらに細かくは、エスキモー、ブラジルインディアン、アラスカインディアン等;ネグロイド、さらに細かくは、ナイジェリア人、バンツー人、ブッシュア人等;オーストラロイド、さらに細かくは、オーストラリア原住民、パプアニューギニア人等、人種レベル、そしてさらに小さなカテゴリーとすることも可能であり、さらに特定の地域や疾病の罹患者の集団等に絞り込むことによって風土病の解析や予測等を的確に行うことも可能である。ただし、いずれにしても具体的なヒトゲノム情報の存在が前提となる。本実施例では、「東北大学 東北メディカル・メガバンク機構(ToMMo)」の日本人1070人分のヒトゲノムのデータベースに基づいて検証を行い、本発明の有用性を検証した。 (IV) “Human genome information” used in the execution of the selection method and computer system of the present invention can be based on information in the human genome database, for example, a database for all human beings of the International 1000 Human Genome Project. However, the accuracy of SNP estimation based on TagSNP tends to be improved by using human genome information with a smaller category. Preferably, Asian Mongoloid, more specifically Japanese, Chinese, Malay, Polynesian, Micronesian, etc .; Caucasian, more specifically Italian, English, Iranian, Indian, Lap, etc .; Amerind, more specifically, Eskimo, Brazilian Indian, Alaska Indian, etc .; Negroid, more specifically, Nigerian, Bantu, Bushua, etc .; Australoid, more specifically, Australian natives, Papua New Guineans, etc. In addition, it is possible to make the category smaller, and it is also possible to accurately analyze and predict endemic disease by narrowing down to a specific region or a group of persons affected by a disease. However, in any case, the existence of specific human genome information is a prerequisite. In this example, the usefulness of the present invention was verified by performing verification based on a human genome database of 1070 Japanese people from “Tohoku University Tohoku Medical Megabank Organization (ToMMo)”.
(V)本発明(選択方法、コンピュータシステム、プログラム)において選択されるTagSNPに対応する核酸プローブ群で検出される遺伝子型は、上述したようにヒトゲノムのSNP情報をインピュテーションを行うために用いるものとして好適なものである。この「TagSNPに対応する核酸プローブ群で検出される遺伝子型を検出するための手段」は、SNPの遺伝子型を検出することができるものであれば特に限定されるものではなく、現在提供されている、又は、将来的に提供されるSNPを検出することができる核酸検出手段が挙げられる。具体的には、DNAマイクロアレイ、次世代シーケンサーNGS、サンガーシーケンサー、マスアレイ(登録商標)等が挙げられる。これらの中でも現時点で最適な手段の一つが、上述の本発明のアレイにより提供されるDNAマイクロアレイによるSNP検出である。 (V) The genotype detected by the nucleic acid probe group corresponding to the TagSNP selected in the present invention (selection method, computer system, program) is used for imputing the SNP information of the human genome as described above. It is suitable as a thing. This “means for detecting the genotype detected by the nucleic acid probe group corresponding to Tag SNP” is not particularly limited as long as it can detect the genotype of SNP, and is currently provided. Or a nucleic acid detection means capable of detecting a SNP provided in the future. Specific examples include a DNA microarray, a next-generation sequencer NGS, a Sanger sequencer, and a mass array (registered trademark). Among these, one of the most suitable means at present is SNP detection using a DNA microarray provided by the above-described array of the present invention.
(VI)当該TagSNP塩基における塩基の多型を検出可能な核酸プローブを用いた具体的な本発明のアレイの生産方法は、本発明時に公知のDNAマイクロアレイの生産方法に従って行うことが可能であり、将来提供されるDNAマイクロアレイの生産方法を適用することも可能である。 (VI) A specific method for producing an array of the present invention using a nucleic acid probe capable of detecting a base polymorphism in the TagSNP base can be performed according to a known method for producing a DNA microarray at the time of the present invention, It is also possible to apply a DNA microarray production method provided in the future.
(VII)他のSNPの付加
 また、本発明においては、TagSNPの選択とは別個に、他の1種又は2種以上のSNPを選択して、当該TagSNPに優先して繰り入れ、あるいは、繰り入れる手段を講じることが可能である。
(VII) Addition of other SNP Further, in the present invention, separately from Tag SNP selection, one or more other SNPs are selected and transferred in preference to the Tag SNP. It is possible to take
 すなわち、本発明の選択方法においては、本発明の選択方法によるTagSNPの選択とは別個に、他の1種又は2種以上のSNPを選択して、当該TagSNPに優先して繰り入れることが可能であり、本発明のアレイに当該他のSNPに対応する核酸プローブ群を搭載することも可能である。 That is, in the selection method of the present invention, it is possible to select one or more other SNPs separately from the Tag SNP selection by the selection method of the present invention, and to carry over the Tag SNP with priority. It is also possible to mount nucleic acid probe groups corresponding to the other SNPs in the array of the present invention.
 また、本発明のコンピュータシステムにおいては、本発明の選択方法によるTagSNPの選択とは別個に、他の1種又は2種以上のSNPを選択して、これらの他のSNPを、選択されるべきSNPとして優先して繰り入れることが可能である。 In addition, in the computer system of the present invention, one or more other SNPs should be selected separately from the Tag SNP selection by the selection method of the present invention, and these other SNPs should be selected. It is possible to carry in with priority as an SNP.
 また、本発明のプログラムにおいては、本発明の選択方法によるTagSNPの選択とは別個に、他の1種又は2種以上のSNPを選択して、これらの他のSNPを、選択されるべきSNPとして優先して特定することを実現させるアルゴリズムを設けることが可能である。以下、特に断りが無い場合には、「他のSNP」とは、上記した「他の1種又は2種以上のSNP」のことを意味するものとする。 Further, in the program of the present invention, separately from the Tag SNP selection by the selection method of the present invention, one or more other SNPs are selected, and these other SNPs are selected as SNPs to be selected. It is possible to provide an algorithm that realizes the identification with priority. Hereinafter, when there is no notice in particular, "other SNP" shall mean "other 1 type (s) or 2 or more types SNP" mentioned above.
 上記した他のSNPの繰り入れに際しては、他のSNPと、本発明の選択方法により選択されるTagSNPの重複は除かれていることが好ましい。この重複したSNPの一方を除く方法は、特に限定されるものではないが、例えば、前記TagSNPの選択を行う際に用いるSNPの母集団から、優先して繰り入れるSNPの抜去を事前に行うか、若しくは、事前に行うための手段を講じるか、あるいは、既に選択されたTagSNPのうち、他のSNPと重複するSNPを、繰り入れるべき他のSNPから事後的に除くか、若しくは、除くための手段を講じるか、等が挙げられる。 In the transfer of other SNPs as described above, it is preferable that the duplication between the other SNPs and the Tag SNP selected by the selection method of the present invention is removed. The method for removing one of the duplicated SNPs is not particularly limited. For example, the SNPs to be preferentially removed from the SNP population used when selecting the Tag SNPs may be removed in advance. Alternatively, take measures to be performed in advance, or remove SNPs that overlap with other SNPs among already selected Tag SNPs from other SNPs to be transferred later, or to remove them Such as taking or not.
 他のSNPとしては、本発明の選択方法では選ばれ難いが実用上有用なSNPが好ましく挙げられる。これらを特定するための核酸プローブを優先して用いることで、DNAアレイをより特徴付ける等の目的を達成できる。 As other SNPs, practically useful SNPs which are difficult to be selected by the selection method of the present invention are preferable. By preferentially using the nucleic acid probe for specifying these, an object such as further characterizing the DNA array can be achieved.
 ただし他のSNPは、それらを基としたインピュテーションを目的として繰り入れられるのではなく、それらの検出自体を、直接的に特定の疾患や遺伝的基質の指標とするために繰り入れられるものである。よって、本発明の選択方法により選択されたTagSNP群によるインピュテーション性能の評価の際には、他のSNPの繰り入れ分は除外される。仮に、他のSNPのうちTagSNPと重複するものがあるにしても、その数は相対的には僅かであり、インピュテーション性能の評価に際しては事実上無視できるものである。特願2014-223834号における実施例4-3では、敢えて繰り入れた他のSNP分を含めてインピュテーション性能を評価している。しかしながら、これは約65万個のSNPのうち相当数(2万個以上)の他のSNP、すなわち、概ねTagSNP以外のSNPが含まれた場合の、インピュテーション性能に与える影響が軽微であることを確認するために行ったものである。具体的には、67.5万個のTagSNP群から21,059個のTagSNPを抜去して、その代わりに同数(21,059個)の、「他のSNP」を付加した。これらの「他のSNP」を敢えて含めて算出されたインピュテーション性能は、MAF1~3%のSNPのrの平均値は0.804、MAF3~5%のSNPでは0.884、MAF5%以上では0.959であり、既存の商用DNAアレイ(OMNI2.5)を上回る優れたインピュテーション性能を示した。 However, other SNPs are not introduced for the purpose of imputation based on them, but are used to directly detect their detection as an indicator of a specific disease or genetic substrate. . Therefore, when evaluating the imputation performance by the Tag SNP group selected by the selection method of the present invention, the extra SNP is excluded. Even if some of the other SNPs overlap with the Tag SNP, the number thereof is relatively small and can be ignored in the evaluation of the imputation performance. In Example 4-3 in Japanese Patent Application No. 2014-223834, the imputation performance is evaluated including other SNPs that have been deliberately introduced. However, this has a slight effect on the imputation performance when a considerable number (20,000 or more) of other SNPs out of about 650,000 SNPs, that is, SNPs other than Tag SNPs are included. It was done to confirm that. Specifically, 21,059 TagSNPs were extracted from the 675,000 TagSNP group, and the same number (21,059) of “other SNPs” was added instead. These "other SNP" dare including imputation performance was calculated an average value of r 2 of the MAF1 ~ 3% of the SNP is 0.804, the MAF3 ~ 5% of the SNP 0.884, MAF5% Above, it was 0.959, indicating an excellent imputation performance over the existing commercial DNA array (OMNI2.5).
 「他のSNP」として用いられる候補となる実用上有用なSNPには、(a)TagSNPとの間の連鎖不平衡の度合いが弱くインピュテーションで十分な精度で遺伝子型を推定しづらいSNP、(b)Y染色体とミトコンドリアのSNP、(c)これまでの研究から疾患との関連が報告されたSNP、(d)HLA領域のSNP、(e)薬物代謝との関連が報告されたSNP等が挙げられる。これらを更に具体的に説明すると以下の通りである。 Practically useful SNPs that are candidates for use as “other SNPs” include (a) SNPs that have a weak linkage disequilibrium with Tag SNPs and are difficult to estimate genotypes with sufficient accuracy by imputation, (B) Y chromosome and mitochondrial SNPs, (c) SNPs that have been reported to be related to diseases from previous studies, (d) SNPs in the HLA region, (e) SNPs that have been reported to be related to drug metabolism, etc. Is mentioned. These will be described more specifically as follows.
(a)TagSNPとの間の連鎖不平衡の度合いが弱くインピュテーションで十分な精度で遺伝子型を推定しづらいSNP:
 この分類の他のSNPには、TagSNPのうち、本発明のTagSNPとの間のr連鎖不平衡値が低い(例えばr<0.2)SNPが該当する。それらの中から、タンパク質のアミノ酸配列に影響をあたえるようなSNPを選択することが実用上好ましい。
(A) The degree of linkage disequilibrium with TagSNP is weak, and it is difficult to estimate genotype with sufficient accuracy by imputation:
Other SNPs in this classification correspond to SNPs having a low r 2 linkage disequilibrium value (for example, r 2 <0.2) with Tag SNPs of the present invention among Tag SNPs. From among these, it is practically preferable to select a SNP that affects the amino acid sequence of the protein.
(b)Y染色体とミトコンドリアのSNP:
 この分類の他のSNPについて、Y染色体領域は遺伝的な組み換えが生じないため、r連鎖不平衡値によるTagSNPの選択が効果をもたない。これらのSNPは数が少ないので、連鎖不平衡値rに関わらずターゲットSNPの中から全て選択することが比較的容易である。
(B) Y chromosome and mitochondrial SNPs:
For other SNP of this class, since the Y chromosome region does not occur genetic recombination, selection of TagSNP by r 2 linkage disequilibrium value has no effect. Since these SNPs are small in number, it is relatively easy to select all the target SNPs regardless of the linkage disequilibrium value r 2 .
(c)これまでの研究から疾患との関連が報告されたSNP:
 この分類の他のSNPは、データベースGWASカタログ(NHGRI GWAS Catalog)に収録されている(http://www.genome.gov/gwastudies/:Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001-6 (2014).)。
(C) SNPs that have been reported to be associated with disease from previous studies:
Other SNPs of this classification are included in the database GWAS catalog (NHGRI GWAS Catalog) (http://www.genome.gov/gwastudies/: Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001-6 (2014).).
(d)HLA領域のSNP:
 この分類の他のSNPについて、HLA領域は疾患との関連が多数報告された領域であり、TagSNPの中から連鎖不平衡値rとは無関係に選択することは、実用上好適である。
(D) HNP SNP:
For other SNP in this classification, HLA region is a region associated is reported many disease, be selected independently of the linkage disequilibrium values r 2 from the TagSNP is practically suitable.
(e)薬物代謝との関連が報告されたSNP:
 この分類の他のSNPについて、Affymetrix(R)DMETTMplus(Affymetrix, Inc)を用いて検討された結果についての下記文献が存在し、これらの文献に記載されているSNPを、他のSNPとして用いることができる。
(E) SNPs reported to be associated with drug metabolism:
For other SNP in this classification, Affymetrix (R) DMET TM plus (Affymetrix, Inc) there is the following literature on the results was studied using the SNP described in these documents, as another SNP Can be used.
[Technology reviews]
・Burmester J. K., et al. DMET microarray technology for pharmacogenomics-based personalized medicine. Methods in Molecular Biology 632:99-124 (2010).
・Sissung T. M., et al. Clinical pharmacology and pharmacogenetics in a genomicsera: the DMET platform. Pharmacogenomics 11(1):89-103 (2010).
・Deeken J. F. The Affymetrix DMET platform and pharmacogenetics in drug development. Current Opinion in Molecular Therapeutics 11(3):260-268 (2009).
[Technology reviews]
・ Burmester J. K., et al. DMET microarray technology for pharmacogenomics-based personalized medicine.Methods in Molecular Biology 632: 99-124 (2010).
・ Sissung T. M., et al. Clinical pharmacology and pharmacogenetics in a genomicsera: the DMET platform.Pharmacogenomics 11 (1): 89-103 (2010).
・ Deeken J. F. The Affymetrix DMET platform and pharmacogenetics in drug development.Current Opinion in Molecular Therapeutics 11 (3): 260-268 (2009).
[Identification of new drug-related biomarkers]
・Caldwell M. D., et al. CYP4F2 genetic variant alters required warfarin dose. Blood 111(8):4106-12 (2008).
・McDonald M. G., et al. CYP4F2 Is a vitamin K1 hydroxylase: A molecular explanation for altered warfarin dose in carriers of the functionally defective V433M variant. 15th North American Regional ISSX meeting Abstract 67 (2008).
[Identification of new drug-related biomarkers]
Caldwell M. D., et al. CYP4F2 genetic variant alters required warfarin dose.Blood 111 (8): 4106-12 (2008).
McDonald M. G., et al. CYP4F2 Is a vitamin K1 hydroxylase: A molecular explanation for altered warfarin dose in carriers of the functionally defective V433M variant.15th North American Regional ISSX meeting Abstract 67 (2008).
[Drug development and safety research]
・Mega J. L., et al. Cytochrome p-450 polymorphisms and response to clopidogrel. New England Journal of Medicine 360(4):354-62 (2009).
・U.S. Food and Drug Administration. Early communication about an ongoing safety review of clopidogrel bisulfate (marketed as Plavix).
・Dumaual C., et al. Comprehensive assessment of metabolic enzyme and transporter genes using the Affymetrix Targeted Genotyping System. Pharmacogenomics 8(3):293-305 (2007).
・Daly T. M., et al. Multiplex assay for comprehensive genotyping of genes involved in drug metabolism, excretion, and transport. Clinical Chemistry 53(7):1222-30 (2007).
[Drug development and safety research]
Mega J. L., et al. Cytochrome p-450 polymorphisms and response to clopidogrel.New England Journal of Medicine 360 (4): 354-62 (2009).
・ US Food and Drug Administration.Early communication about an ongoing safety review of clopidogrel bisulfate (marketed as Plavix).
・ Dumaual C., et al. Comprehensive assessment of metabolic enzyme and transporter genes using the Affymetrix Targeted Genotyping System.Pharmacogenomics 8 (3): 293-305 (2007).
・ Daly T. M., et al. Multiplex assay for comprehensive genotyping of genes involved in drug metabolism, excretion, and transport.Clinical Chemistry 53 (7): 1222-30 (2007).
[Genotype/phenotype databasing]
・Man M., et al. Genetic variation in metabolizing enzyme and transporter genes: Comprehensive assessment in 3 major East Asian subpopulations with comparison to Caucasians and Africans. Journal of Clinical Pharmacology doi: 10.1177/0091270009355161 (2010).
・UNC's McCleod discusses 'practical' approach to bringing pharmacogenetics to all countries. GenomeWeb Pharmacogenomics Reporter (2010).
[Genotype / phenotype databasing]
・ Man M., et al. Genetic variation in metabolizing enzyme and transporter genes: Comprehensive assessment in 3 major East Asian subpopulations with comparison to Caucasians and Africans.Journal of Clinical Pharmacology doi: 10.1177 / 0091270009355161 (2010).
・ UNC's McCleod discusses 'practical' approach to bringing pharmacogenetics to all countries.GenomeWeb Pharmacogenomics Reporter (2010).
 本発明により、SNP検出用のDNAマイクロアレイ等のインピュテーションするための手段において用いられるTagSNP数を大幅に節約し、かつ、当該手段において得られた結果に基づくインピュテーション性能が既存の商用DNAマイクロアレイ等と同等以上の精度を保つことが可能な手段と、当該手段により生産されたDNAマイクロアレイとその生産方法が提供される。さらに具体的には、本発明は、上記のTagSNP数の大幅な節約と優れたインピュテーション性能に基づいて、SNP検出用の核酸プローブの選択を安価で行うことを可能とし、遺伝情報のサービスの安価での提供を可能にする。また、優れたインピュテーション性能が発揮されるのに必要なアレイ検出部を、核酸プローブ数を大幅に節約することでコンパクト化することも可能であり、今後の遺伝子解析技術の性能の向上に大いに資すると考えられる。さらに付言すれば、後述する実施例は日本人を母集団とした結果を開示するが、本来本発明は、あらゆる人種を基にした母集団に対して適用することが可能であり、さらに異なる人種におけるインピュテーションにおいても応用可能である。 According to the present invention, the number of Tag SNPs used in a means for imputation, such as a DNA microarray for SNP detection, can be greatly saved, and the imputation performance based on the results obtained in the means can be improved with existing commercial DNA. Provided are means capable of maintaining an accuracy equal to or higher than that of a microarray, a DNA microarray produced by the means, and a production method thereof. More specifically, the present invention makes it possible to select a nucleic acid probe for SNP detection at a low cost based on the above significant savings in the number of Tag SNPs and excellent imputation performance. Can be offered at low cost. In addition, it is possible to downsize the array detection unit necessary for excellent imputation performance by greatly reducing the number of nucleic acid probes, which will improve the performance of future gene analysis technology. It is thought that it will greatly contribute. In addition, although the examples described below disclose the results of using Japanese as a population, the present invention is inherently applicable to a population based on any race and is further different. It can also be applied to race imputation.
本発明のプログラムの内容を概略したフローチャートである。It is the flowchart which outlined the content of the program of this invention. 図1をより具体的に表現したフローチャートである。It is the flowchart which expressed FIG. 1 more concretely.
 本発明の目的の一つは、上述のように、SNP検出用のDNAマイクロアレイ等を用いたインピュテーションを行うための手段において用いる、当該アレイに搭載する核酸プローブに対応するTagSNP数を大幅に節約し、かつ、当該手段により得られた結果に基づくインピュテーション性能が、既存の商用DNAマイクロアレイ等と同等以上の精度を保つことが可能なTagSNP群を選択し、これらに対応する核酸プローブを搭載したDNAマイクロアレイを調製することにある。この目的は上述の本発明の選択方法に従って達成することが可能である。そしてこの本発明の選択方法は、好適には本発明のコンピュータシステムにおいて、本発明のプログラムを実行することにより行うことができる。 One of the objects of the present invention is that, as described above, the number of Tag SNPs corresponding to the nucleic acid probes mounted on the array used in the means for performing the imputation using the DNA microarray for SNP detection is greatly increased. Select a Tag SNP group that can save and maintain an accuracy of imputation performance based on the result obtained by the means equivalent to or better than that of existing commercial DNA microarrays, etc., and select corresponding nucleic acid probes. The purpose is to prepare an on-board DNA microarray. This object can be achieved according to the selection method of the present invention described above. The selection method of the present invention can be performed by executing the program of the present invention in the computer system of the present invention.
(1)本発明の選択方法
 本発明の選択方法における「複数個人の遺伝子型が特定されたSNP群の情報が含まれるヒトゲノム情報」において、SNP群の特定手段は、次世代シークエンサ(NGS)等を用いた複数のヒトゲノムの塩基配列から公知の統計学的処理を用いて行うことができる。
(1) Selection method of the present invention In “human genome information including information of SNP groups in which genotypes of a plurality of individuals are specified” in the selection method of the present invention, the means for specifying the SNP group is a next-generation sequencer (NGS) or the like. It can carry out using a well-known statistical process from the base sequence of several human genomes using.
 また、本発明の選択方法の指標である「相互情報量」や、「r連鎖不平衡値」等の連鎖不平衡値を得るために、上記の「各々のSNPのヒトゲノム上の遺伝子座と遺伝子型」から、TagSNPとTargetSNPの遺伝子型の頻度が算出されていることが必要である。当該頻度は、常法により得ることができる。SNP群のハプロタイプの特定が行われていると、SNP群の連鎖不平衡値および相互情報量の計算をより精密に行うことが可能となり、好適である。この場合、前述の遺伝子型の頻度は遺伝子型を構成する対立遺伝子の頻度と置き換え、2つのSNP間の遺伝子型の組み合わせの頻度は特定されたハプロタイプの頻度と置き換えればよい。さらに、ハプロタイプの特定手段である「フェージング処理」は公知である。 In addition, in order to obtain linkage disequilibrium values such as “mutual information” and “r 2 linkage disequilibrium value” which are indices of the selection method of the present invention, It is necessary that the frequency of the genotypes of TagSNP and TargetSNP is calculated from “genotype”. The frequency can be obtained by a conventional method. When the haplotype of the SNP group is specified, it is possible to calculate the linkage disequilibrium value and mutual information of the SNP group more precisely, which is preferable. In this case, the frequency of the aforementioned genotype is replaced with the frequency of alleles constituting the genotype, and the frequency of the combination of genotypes between the two SNPs may be replaced with the frequency of the specified haplotype. Furthermore, “fading processing”, which is a haplotype specifying means, is known.
 フェージング処理の方法は、下記の2通りに大別される。 ¡Fading processing methods are roughly divided into the following two methods.
(A)分離座位(多型座位)間の連鎖不平衡を利用した方法(SHAPEIT2:Delaneau et al., Improved whole chromosome phasing for disease and population genetic studies, Nature Methods, 2013;MaCH: Li et al., MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes, Genetic Epidemiology, 2010)
 この方法は、通常1000人以上の集団の遺伝子型データを用い、統計的にフェージングを行う方法であり、アレル頻度が高い(5%以上)変異のある座位において精度が高いが、アレル頻度が低い座位については、データ数の不足により、精度が低くなる傾向があり、高い精度を得るためには膨大なサンプル集団の遺伝子型が必要となる。
(A) Method utilizing linkage disequilibrium between segregated loci (polymorphic loci) (SHAPEIT2: Delaneau et al., Improved whole chromosome phasing for disease and population genetic studies, Nature Methods, 2013; MaCH: Li et al., MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes, Genetic Epidemiology, 2010)
This method is a method of statistically fading, usually using genotype data of a population of 1000 or more, and is highly accurate at loci with mutations having high allele frequency (5% or more), but low allele frequency. As for the locus, the accuracy tends to be low due to the lack of the number of data, and in order to obtain high accuracy, the genotype of a huge sample population is required.
(B)シークエンサーのリード情報を利用した方法(GATK Read Backed Phasing(開発元Broad Institute); HapCompass:Aguiar D., and Istrail S., Hapcompass:a fast cycle basis algorithm for accurate haplotype assembly of sequence data, Journal of Computational Biology, 2012)
 この方法は、ヘテロ接合座位間をまたぐ形でシークエンサーのリードが得られた場合に、リード内の塩基を調べることでフェージングを行う方法であり、アリル頻度の低い座位についてもフェージングが可能であるが、シークエンサーのリードの長さは、通常長くても数百bpに限られることから、フェージング可能な範囲は限定される傾向にある。ただし、次世代シークエンサー技術の進歩と共に、リード長は延長されつつある。
(B) Method using sequencer read information (GATK Read Backed Phasing (Developer: Broad Institute); HapCompass: Aguiar D., and Istrail S., Hapcompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data, Journal of Computational Biology, 2012)
In this method, when sequencer reads are obtained across the heterozygous loci, fading is performed by examining the bases in the leads, and fading is possible even at loci with low allele frequencies. The length of the sequencer lead is usually limited to a few hundred bp at the longest, so the fading range tends to be limited. However, with the advancement of next-generation sequencer technology, the lead length is being extended.
 本発明の選択方法では、
 a)当該ヒトゲノムデータベース中のSNP群を母集団として、その中でTagSNP候補となる各々のSNPの遺伝子座から一定範囲に定められた近傍に存在するSNPをTargetSNPとして、当該TagSNP候補とこれらのTargetSNPとの間の相互情報量の和を算出する。
In the selection method of the present invention,
a) A SNP group in the human genome database as a population, and SNPs existing in the vicinity within a certain range from the locus of each SNP that is a Tag SNP candidate are set as Target SNPs, and the Tag SNP candidates and these Target SNPs To calculate the sum of mutual information.
 相互情報量とは、2つの確率変数xとyが確率分布P(x)とP(y)に従い、かつxとyの同時確率がP(x,y)に従うときに下記の式で定義される量である。 Mutual information is defined by the following equation when two random variables x and y follow probability distributions P (x) and P (y), and the joint probability of x and y follows P (x, y). Amount.
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 本発明においては、x,yはそれぞれ2つの異なるSNPの遺伝子型であり、p(x)とp(y)はその頻度に対応する。p(x,y)はそれらの遺伝子型が2つのSNPで同時に観察される頻度である。この定義に従って「TagSNP候補とTargetSNPの相互情報量」を算出することができる。言い換えれば、相互情報量を算出する前提として、各々のTagSNP候補の遺伝子型の頻度の他に、各々のTagSNP候補の遺伝子座から一定範囲に定められた近傍内に存在する各々のTargetSNPの遺伝子型が同時に観察される頻度が算出されていることも必要となる。ただし、SNP群のハプロタイプの特定が行われている場合は、遺伝子型の頻度は遺伝子型を構成する対立遺伝子頻度と置き換え、2つのSNPで同時に遺伝子型が観察される頻度はハプロタイプの頻度と置き換えればよい。 In the present invention, x and y are two different SNP genotypes, and p (x) and p (y) correspond to their frequencies. p (x, y) is the frequency with which these genotypes are observed simultaneously in two SNPs. According to this definition, the “mutual information amount between Tag SNP candidate and Target SNP” can be calculated. In other words, as a premise for calculating the mutual information amount, in addition to the frequency of the genotype of each TagSNP candidate, the genotype of each TargetSNP existing within the vicinity defined within a certain range from the locus of each TagSNP candidate It is also necessary to calculate the frequency at which the two are simultaneously observed. However, when the haplotype of the SNP group is specified, the genotype frequency is replaced with the allele frequency constituting the genotype, and the frequency at which genotypes are simultaneously observed in the two SNPs is replaced with the haplotype frequency. That's fine.
 このようにして算出した「TagSNP候補とこれらのTargetSNPとの間の相互情報量」各々の和を算出して、本発明の選択方法における指標の本質要素を得ることができる。 It is possible to obtain the essential elements of the index in the selection method of the present invention by calculating the sum of each “mutual information amount between Tag SNP candidates and these Target SNPs” calculated in this way.
 そして、b)全TagSNP候補の中から、前記相互情報量の総和の値が大きいTagSNP候補を、上記のインピュテーションするための手段として用いる核酸プローブ中に存在させるTargetSNPとして、当該総和の大きい順に選択することで、本発明の選択方法を行うことができる。 And b) Tag SNP candidates having a large value of the sum of mutual information among all Tag SNP candidates, as Target SNPs to be present in the nucleic acid probe used as the means for imputation, in the descending order of the sum. By selecting, the selection method of the present invention can be performed.
 上述したように本発明の選択方法では、TargetSNP群が予め上記の相互情報量以外の指標により絞り込まれて構成されることが、TagSNPの選択の効率化を行う観点から好適である。その中でも特に好適である「r連鎖不平衡値(Rスクエア値又はR^2)」とは、2つのSNPの間の遺伝子型の頻度に関するピアソンの相関係数であり、0~1の値を示し、1に近いほど強い連鎖不平衡があることを示す指標である。ただし、SNP群のハプロタイプの特定が行われている場合は、遺伝子型の頻度は遺伝子型を構成する対立遺伝子頻度と置き換え、2つのSNPで同時に遺伝子型が観察される頻度はハプロタイプの頻度と置き換えればよい。 As described above, in the selection method of the present invention, it is preferable that the TargetSNP group is configured by being narrowed down in advance by an index other than the mutual information amount, from the viewpoint of improving the efficiency of TagSNP selection. The “r 2 linkage disequilibrium value (R square value or R ^ 2)” that is particularly suitable among them is Pearson's correlation coefficient regarding the frequency of genotype between two SNPs, and a value of 0 to 1 Is an index indicating that there is a stronger linkage disequilibrium as the value approaches 1. However, when the haplotype of the SNP group is specified, the genotype frequency is replaced with the allele frequency constituting the genotype, and the frequency at which genotypes are simultaneously observed in the two SNPs is replaced with the haplotype frequency. That's fine.
 このr連鎖不平衡値等の連鎖不平衡値において、連鎖不平衡性が一定以上大きいTargetSNP群を事前選択することにより、本発明の選択方法を効率的に行うことができる。r連鎖不平衡値の選択の閾値については上述した。さらに、「一定範囲に定められた近傍」、「選択するべきTagSNPの個数」についても上述した。そして、「他のSNPの繰り入れ」についても上述した。 The selection method of the present invention can be efficiently performed by pre-selecting a Target SNP group having a linkage disequilibrium value greater than a certain value in the linkage disequilibrium value such as the r 2 linkage disequilibrium value. The threshold for selection of r 2 linkage disequilibrium values has been described above. Furthermore, the “neighboring defined within a certain range” and the “number of Tag SNPs to be selected” are also described above. The “addition of other SNPs” is also described above.
(2)本発明のコンピュータシステムとコンピュータプログラム
 本発明のコンピュータシステムは、上述した本発明の選択方法を行う手段となるシステムであり、本発明のプログラムは、本発明のコンピュータシステムに本発明の選択方法を行わせるためのアルゴリズムを備えたコンピュータプログラムである。「アルゴリズム」とは、コンピュータ分野の一般的な概念と同じく、問題を解くための手順を定式化した形で表現したものを意味する。
(2) Computer system and computer program of the present invention The computer system of the present invention is a system that serves as means for performing the above-described selection method of the present invention, and the program of the present invention selects the present invention in the computer system of the present invention. A computer program with an algorithm for performing the method. “Algorithm” means a formalized representation of a procedure for solving a problem, as in the general concept of the computer field.
 本発明のコンピュータシステムは、通常のコンピュータシステムに関わるハードウエアを備えることができる。すなわち、通常ハードディスクドライブに該当する「記録部」、CPUに相当する「演算処理部」の他、例えば、RAMに相当する「一時記憶部」、キーボード、マウス、タッチパネル等に相当する「操作部」、ディスプレイに相当する「表示部」、操作部に応じたシリアル又はパラレルインターフェース等に相当する「出入力インターフェース(IF)部」、ビデオメモリとD/A変換部を備え、表示部のビデオ方式に応じたアナログ信号を出力する「通信インターフェース(IF)部」を備えている。当該通信IF部では、外部の情報、特に、ヒトゲノムデータベース等のヒトゲノム情報とデータ交換を行うことができる。 The computer system of the present invention can include hardware related to a normal computer system. That is, in addition to a “recording unit” corresponding to a normal hard disk drive and an “arithmetic processing unit” corresponding to a CPU, for example, a “temporary storage unit” corresponding to a RAM, an “operation unit” corresponding to a keyboard, a mouse, a touch panel, etc. A display unit corresponding to a display, an input / output interface (IF) unit corresponding to a serial or parallel interface corresponding to an operation unit, a video memory and a D / A converter, A “communication interface (IF) unit” that outputs a corresponding analog signal is provided. The communication IF unit can exchange data with external information, particularly human genome information such as a human genome database.
 以下においては特に断らない限り、本発明のコンピュータシステムの「演算処理部」が行う処理として説明する。「演算処理部」は、「操作部」が操作されて「通信IF部」を介して、特にヒトゲノムデータベースのデータを取得して「記録部」に記録し、適宜当該「記録部」からデータを「一時記憶部」に読み出し、所定の処理を行った後、その結果を再度「記録部」に記録する。当該「演算処理部」は、「操作部」の操作を促す画面データや処理結果を表示する画面データを作成し、入力IF部のビデオRAMを介して、これらの画像を「表示部」に表示する。本発明のプログラムは、用時又は予め「記録部」に記録、あるいは、外部のハードウエア資源に記録されており、必要に応じて「演算処理部」において、記載されたアルゴリズムに従った演算処理が行われる。 Hereinafter, unless otherwise specified, the description will be made as processing performed by the “arithmetic processing unit” of the computer system of the present invention. The “arithmetic processing unit” is operated by operating the “operation unit” to acquire the data of the human genome database in particular through the “communication IF unit” and record the data in the “recording unit”. After reading into the “temporary storage unit” and performing a predetermined process, the result is recorded in the “recording unit” again. The “arithmetic processing unit” creates screen data for prompting operation of the “operation unit” and screen data for displaying the processing results, and displays these images on the “display unit” via the video RAM of the input IF unit. To do. The program of the present invention is used or recorded in advance in the “recording unit”, or recorded in an external hardware resource, and in the “arithmetic processing unit” according to the algorithm described in the “arithmetic processing unit” as necessary. Is done.
 図1は、本発明のプログラムの内容を概略したフローチャートであり、図2は、図1をより具体的に表現したフローチャートである。ステップS1は、図1・図2共通であり、「SNP毎の部位(染色体、ポジション)と各個人の遺伝子型の情報が含まれる入力ファイルからTargetSNP、TagSNP候補、及び、それらの遺伝子座の遺伝子型を読み出す」ステップである。後述する実施例では、この入力ファイルは、リファレンスパネル、すなわち、東北メディカル・メガバンク機構(ToMMo)でNGS(次世代シークエンサー)を用いて決定した日本人1070人の全長ゲノムのデータファイルにおいて変異の見つかった染色体部位の情報から構成されるファイルを、ヒトゲノム情報の一例として用いた。 FIG. 1 is a flowchart outlining the contents of the program of the present invention, and FIG. 2 is a flowchart more specifically expressing FIG. Step S1 is common to FIG. 1 and FIG. 2, “Target SNP, Tag SNP candidates, and genes of those loci from an input file containing information on the site (chromosome, position) for each SNP and the genotype of each individual. This is a step of “reading a type”. In the examples described below, this input file is a mutation found in a reference panel, that is, a data file of 1070 Japanese full-length genomes determined by NGS (Next Generation Sequencer) at Tohoku Medical Megabank Organization (ToMMo). A file composed of information on chromosome sites was used as an example of human genome information.
 このステップS1は、本発明のプログラムの第一の機能を記述している。すなわち、このステップS1は、複数個人の遺伝子型が含まれるヒトゲノム情報において、
 (a)各々のTagSNP候補のヒトゲノム上の遺伝子座、
 (b)個々のヒトゲノム情報におけるTagSNP候補の遺伝子型、
 (c)TargetSNPのヒトゲノム上の遺伝子座、
 (d)個々のヒトゲノム情報におけるTargetSNPの遺伝子型、
 が記録されている記録部から、これら(a)~(d)の情報を演算処理部における処理のために読み出す「第一の機能」を記述している。
This step S1 describes the first function of the program of the present invention. That is, this step S1 is performed in human genome information including a plurality of individual genotypes.
(A) a locus on the human genome of each TagSNP candidate;
(B) Genotype of TagSNP candidate in individual human genome information,
(C) a locus on the human genome of TargetSNP,
(D) Target SNP genotype in individual human genome information,
Describes a “first function” for reading the information (a) to (d) for processing in the arithmetic processing unit from the recording unit in which is recorded.
 上述したように、このステップS1の前ステップとして、「他のSNP」を優先して繰り入れるためのステップを設けることが可能である。この場合、上記のTagSNP候補から、当該他のSNPから抜去するステップを設けることが好適である。この事前繰り入れのステップは、後述する事後繰り入れのステップと択一的に設けることが好適である。 As described above, it is possible to provide a step for giving priority to “other SNPs” as a step before step S1. In this case, it is preferable to provide a step of extracting from the other SNP candidates from the above Tag SNP candidates. This pre-feeding step is preferably provided as an alternative to the post-feeding step described later.
 図2に示すステップS1’は、以降選択されるTagSNPとTargetSNPについての初期設定状態を示している。ステップS1’において、「s」とは選択されるTagSNPの数を示し、この時点では「s=0」すなわち、一つのTagSNPも選ばれていないことを記述している。これに関連して「S=[0,…,0]」とは、TagSNP候補(行[ ]内の0の個数が精査されるべきSNPの数である。これが1になった場合は、その1が示すSNPがTagSNP候補として選択されていることを示す)が全く選択されていないことを示している。「T=[0,…,0]」は、上記の「TagSNP候補」を「TargetSNP」に替えて同様の内容を示している。 Step S1 'shown in FIG. 2 shows an initial setting state for the Tag SNP and Target SNP that are selected thereafter. In step S <b> 1 ′, “s” indicates the number of Tag SNPs to be selected. At this time, “s = 0”, that is, no Tag SNP is selected. In this context, “S = [0,..., 0]” is the number of Tag SNP candidates (the number of SNPs in which the number of 0s in the row [] is to be scrutinized. 1 indicates that the SNP indicated by 1 is selected as a Tag SNP candidate). “T = [0,..., 0]” indicates the same content by replacing the “TagSNP candidate” with “TargetSNP”.
 図1のステップS2は、ステップS1で記録部から読み出されたヒトゲノム情報を、「全ての未選択のTagSNP候補について、スコアを計算する」ステップである。このステップS2には、本発明のプログラムの第二の機能の前半が記述されている。図2のステップS2-1(1)、S2-2、S2-3(1)、S2-4、S2-5、S2-3(2)、及び、S2-1(2)が、この図1のステップS2に該当する。これらを総じて「ステップS2」として記載する。なお、ステップS2-1(1)/(2)、及び、ステップS2-3(1)/(2)は、それぞれ一組のループ端である。 Step S2 in FIG. 1 is a step of “calculating scores for all unselected Tag SNP candidates” from the human genome information read from the recording unit in step S1. In this step S2, the first half of the second function of the program of the present invention is described. Steps S2-1 (1), S2-2, S2-3 (1), S2-4, S2-5, S2-3 (2), and S2-1 (2) in FIG. Corresponds to step S2. These are collectively described as “Step S2”. Steps S2-1 (1) / (2) and Steps S2-3 (1) / (2) are each a set of loop ends.
 ステップS2では、前記第一の機能により読み出された(1)~(4)の情報に基づき、個々のTagSNP候補毎に対応するTargetSNPとの間の相互情報量の和を計算し、これらの中で当該和をスコアとする機能が記述されている。相互情報量とは、前述した内容の数値計算により算出される情報概念であり、算出の前提として、各々のTagSNP候補の遺伝子型の頻度の他に、各々のTagSNP候補の遺伝子座から一定範囲に定められた近傍内に存在するTargetSNP各々における、当該TagSNP候補とTargetSNP候補の遺伝子型の組み合わせの頻度が算出されていることも必要であり、これらの頻度計算はこのステップS2において行われることが好適である。 In step S2, based on the information of (1) to (4) read out by the first function, the sum of mutual information amounts with the Target SNP corresponding to each Tag SNP candidate is calculated. A function that uses the sum as a score is described. The mutual information amount is an information concept calculated by numerical calculation of the above-described contents. As a premise for calculation, in addition to the frequency of the genotype of each Tag SNP candidate, the mutual information amount is within a certain range from the locus of each Tag SNP candidate. It is also necessary to calculate the frequency of the combination of the TagSNP candidate and the target SNP candidate genotype in each Target SNP existing in the determined neighborhood, and these frequency calculations are preferably performed in this step S2. It is.
 また本例では、各々のTagSNPに対して相互情報量を計算するTargetSNP絞り込みをr連鎖不平衡値(R^2)の下限を定めた閾値により行う好適態様を示している。r連鎖不平衡値の算出方法は前述した通りであり、閾値の好適範囲も前述した通りであるが、後述する実施例では「r>0.8」を閾値として用いた。 In the present example shows a preferred embodiment for performing the threshold that defines the lower limit of r 2 linkage disequilibrium values TargetSNP narrowing of calculating the mutual information for each TagSNP (R ^ 2). The method for calculating the r 2 linkage disequilibrium value is as described above, and the preferable range of the threshold is also as described above, but in the examples described later, “r 2 > 0.8” was used as the threshold.
 図2に示すステップS2-1(1)は、M個あるTagSNP候補「i」をそれぞれ一つずつ選択するはじめのループ端である。ステップS2-2の「スコア:I(i)=0」は、この時点でステップS2-1(1)で選択されたTagSNP候補「i」の初期化を示している。ステップS2-3(1)は、N個あるTargetSNP「j」をそれぞれ一つずつ選択するはじめのループ端である。 Step S2-1 (1) shown in FIG. 2 is the first loop end for selecting M TagSNP candidates “i” one by one. “Score: I (i) = 0” in step S2-2 indicates initialization of the TagSNP candidate “i” selected in step S2-1 (1) at this time. Step S2-3 (1) is the first loop end for selecting N TargetSNPs “j” one by one.
 ステップS2-4は、スコアの計算を行うか否かの判断を行うステップであることを示している。TagSNP候補「i」と、それと組で精査されるTargetSNP「j」の組み合わせにおいて、「L[i,j]<=L0」とは、TagSNP候補「i」とTargetSNP「j」とのゲノム上の距離(bp)である「L0」が特定値以下であることを示している。すなわち「L0」とは、TagSNP候補の遺伝子座から一定範囲に定められた近傍内の距離を示している。当該距離については上述した通りである。また、「R[i,j]>=R0」とは、TagSNP候補「i」とTargetSNP「j」との間のr連鎖不平衡値が閾値「R0以上」であることを示している。当該閾値についても上述した通りである。T[j]は、精査済みのTargetSNP「j」が1つ以上のTagSNP候補で既にカバーされている場合は1、カバーされていない場合には0を示すものである。すなわちT[j]=0であれば、選択されたTargetSNP「j」は、組となるTagSNP候補「i」によってカバーされていないことを示している。このステップ2-4は、その判断ボックス内の条件について「Yes」であれば、次のステップS2-5に進み、「No」であれば、再びステップS2-3(1)に戻る判断がなされるステップとして記述されている。 Step S2-4 indicates that this is a step for determining whether or not to calculate a score. In a combination of a TagSNP candidate “i” and a TargetSNP “j” to be examined in combination with it, “L [i, j] <= L0” means that the TagSNP candidate “i” and the TargetSNP “j” are on the genome. The distance (bp) “L0” is equal to or less than a specific value. That is, “L0” indicates a distance within a predetermined range from the locus of the TagSNP candidate. The distance is as described above. Further, “R [i, j]> = R0” indicates that the r 2 linkage disequilibrium value between the TagSNP candidate “i” and the TargetSNP “j” is a threshold value “R0 or more”. The threshold value is also as described above. T [j] indicates 1 if the already examined TargetSNP “j” is already covered by one or more TagSNP candidates, and indicates 0 if it is not covered. That is, if T [j] = 0, it indicates that the selected TargetSNP “j” is not covered by the TagSNP candidate “i” to be paired. In step 2-4, if the condition in the determination box is “Yes”, the process proceeds to the next step S2-5. If “No”, the process returns to step S2-3 (1) again. It is described as a step.
 ステップS2-5は、ステップS2-4において「Yes」の判断がなされた場合にスコアを計算して、その値をTagSNP候補「i」に対して加算するステップである。前述したように「スコア」とは、TagSNP候補「i」と組となってカバーされるTargetSNP「j」との間の相互情報量である。 Step S2-5 is a step of calculating a score when “Yes” is determined in Step S2-4 and adding the value to the TagSNP candidate “i”. As described above, the “score” is the mutual information amount between the TagSNP candidate “i” and the TargetSNP “j” covered as a pair.
 ステップS2-3(2)は、上述したTargetSNPを選択するステップS2-3(1)のループ終端であり、ステップS2-1(2)は、上述したTagSNP候補を選択するステップS2-1(1)のループ終端である。これらのループによって、精査されるTagSNP候補とTargetSNPの組が更新される。 Step S2-3 (2) is the loop termination of step S2-3 (1) for selecting the above-described Target SNP, and step S2-1 (2) is step S2-1 (1) for selecting the above-described Tag SNP candidate. ) Loop end. These loops update the TagSNP candidate and TargetSNP pairs to be scrutinized.
 図1に示すステップS3は、「ステップS2において算出されたスコアが最大のTagSNP候補を1つ選び出す」ステップである。このステップS2には、本発明のプログラムの第二の機能の後半が記述されており、図2に示すステップS3-1、S3-2(1)、S3-3、及び、S3-2(2)に該当する。ステップS3-2(1)/(2)は、一組のループ端である。 Step S3 shown in FIG. 1 is a step of “selecting a Tag SNP candidate having the maximum score calculated in step S2”. This step S2 describes the second half of the second function of the program of the present invention. Steps S3-1, S3-2 (1), S3-3, and S3-2 (2) shown in FIG. ) Step S3-2 (1) / (2) is a set of loop ends.
 ステップS3-1では、ステップS2で計算されたスコアが最大のTagSNP候補の番号を「k」として、これを選択すべきTagSNPとして上記のS値行の一つを「1」とするステップである。ステップS3-2(1)は、スコアが最大値を示すTagSNP「k」に対応させる全てのTargetSNP(j=1,…,N)が、TagSNP「k」にカバーされていることを記録するはじまりのループ端であり、ステップS3-3は、次ステップS3-4のT[j]=1への更新記述を行うか否かの判断を行うステップである。すなわち、現時点でスコアが最大のTagSNP「k」と、これに対応するTargetSNP群の中の一つのTargetSNP「j」との間のr連鎖不平衡値が閾値「R0以上」である場合は、「yes」との判断がなされて次ステップS3-4に進み、当該TargetSNP「j」は、TagSNP「k」おけるTargetSNPとして既にカバーされていることが確定し、T[j]=1として更新される。次いで、上記のステップ3-2(1)のループ終端であるステップS3-2(2)において、再びステップS3-2(1)に戻り、次のTargetSNPについての確認を行い、上記TargetSNP群における全てのTargetSNPに対するこれらの確認が完了した時点で、このループは終了し、次のステップS4に進むことができる。これに対して、TargetSNP「j」に関する前記r連鎖不平衡値が閾値「R0より小さい」場合は、ステップS3-3において「no」の判断がなされ、再びステップS3-2(1)に戻り、当該TargetSNP「j」にはカバーされた記録はつかず、次のTargetSNPについて同様の確認が行われる。 In step S3-1, the number of the TagSNP candidate having the maximum score calculated in step S2 is set to “k”, and one of the above S value rows is set to “1” as the TagSNP to be selected. . Step S3-2 (1) starts recording that all TargetSNPs (j = 1,..., N) corresponding to the TagSNP “k” having the maximum score are covered by the TagSNP “k”. Step S3-3 is a step of determining whether or not to perform update description to T [j] = 1 in the next step S3-4. That is, the score the maximum TagSNP "k" at the moment, when r 2 linkage disequilibrium values between one TargetSNP in TargetSNP group "j" corresponding to this is the threshold value "R0 or more" A determination of “yes” is made, and the process proceeds to the next step S3-4, where it is determined that the TargetSNP “j” is already covered as a TargetSNP in TagSNP “k” and updated as T [j] = 1. The Next, in step S3-2 (2), which is the loop end of step 3-2 (1), the process returns to step S3-2 (1) again, and the next Target SNP is confirmed, and all of the Target SNP groups are checked. When these confirmations for the target SNP are completed, the loop is finished and the process can proceed to the next step S4. In contrast, if TargetSNP the r 2 linkage disequilibrium values "less than R0" threshold for "j", the judgment of "no" is made in step S3-3, the flow returns to step S3-2 (1) The target SNP “j” is not covered and the same confirmation is performed for the next Target SNP.
 ステップS4は、図1・図2共通であり、「選択したTagSNP候補の合計が予定数に達しているか否かを判断する」ステップである。図2では、搭載数を「S0」としての判断として記述されている。このステップS4は、本発明のプログラムの第三の機能が記述されている。すなわち、第二の機能を行うステップS2とS3により選択されたTargetSNP群の情報が抜去された、前記TagSNP情報及びTargetSNP情報を基にして、再度ステップS2とS3により、最大の相互情報量の和(前述のように、本例ではr連鎖不平衡値の閾値による事前選択を行う)を伴うTagSNPを再び選択して、第二のTagSNPとして選択を行い、以降ステップS2とS3を繰り返して、この繰り返し工程を「SNP検出用DNAマイクロアレイ等のインピュテーションするための手段における予定数」に達するまで行う、第三の機能が記述されている。 Step S4 is common to FIGS. 1 and 2 and is a step of “determining whether the total number of selected Tag SNP candidates has reached the predetermined number”. In FIG. 2, it is described as a determination that the number of mounted devices is “S0”. This step S4 describes the third function of the program of the present invention. That is, based on the Tag SNP information and Target SNP information from which the information of the Target SNP group selected in Steps S2 and S3 performing the second function has been extracted, the sum of the maximum mutual information amount is again obtained in Steps S2 and S3. (As mentioned above, in this example, the Tag SNP with the threshold of the r 2 linkage disequilibrium value is selected again and selected as the second Tag SNP. Thereafter, Steps S2 and S3 are repeated, A third function is described in which this iterative process is performed until reaching the “predetermined number of means for imputing a DNA microarray for SNP detection”.
 上述したように、このステップS4の後ステップとして、「他のSNP」を優先して繰り入れるためのステップを設けることが可能である。この場合、当該他のSNPから、既に選択された上記のTagSNPを抜去するステップを設けることが好適である。この事後繰り入れのステップは、上述の事前繰り入れのステップと択一的に設けることが好適である。 As described above, as a step after step S4, a step for giving priority to “other SNP” can be provided. In this case, it is preferable to provide a step of extracting the already selected Tag SNP from the other SNP. This post-feeding step is preferably provided alternatively to the above-described pre-feeding step.
 本発明のプログラムは、例えば、C言語、Java(登録商標)、Perl、Python等で記載することが可能であり、マルチプラットフォームとすることも可能である。 The program of the present invention can be described in, for example, C language, Java (registered trademark), Perl, Python, etc., and can be a multi-platform.
 さらに本発明のプログラムは、コンピュータで読み取り可能な記録媒体又はコンピュータに接続し得る記録媒体に保存することが可能であり、これらの記録媒体も本発明の記憶媒体として提供される。これらの記録媒体としては、フレキシブルディスク、フラッシュメモリ、ハードディスク等の磁気的媒体、CD、DVD、BD等の光学的媒体、MO、MD等の磁気光学的媒体等が挙げられ、特に限定されるものではない。 Furthermore, the program of the present invention can be stored in a computer-readable recording medium or a recording medium that can be connected to a computer, and these recording media are also provided as the storage medium of the present invention. Examples of these recording media include magnetic media such as flexible disks, flash memories, and hard disks, optical media such as CD, DVD, and BD, magneto-optical media such as MO and MD, and the like. is not.
(3)本発明のアレイ
 本発明のアレイは、上述した本発明の選択方法ないしコンピュータシステムを用いて選択されたTagSNPの情報(第1工程)に対応する核酸プローブを搭載すること(第2工程)、すなわち(a)本発明の選択方法に従い、TagSNPを選択する第1工程;(b)第1工程により選択されたTagSNPに基づいて、検体中のヒトゲノム中の当該TagSNPの遺伝子型を検出するための核酸プローブを、DNAマイクロアレイに搭載する第2工程;により生産することができる。当該第2工程としては、公知の方法を広く用いることが可能であると共に、将来的に提供されるDNAマイクロアレイの新たな生産手段を用いることも、本発明の効果を損なわない限り可能である。
(3) Array of the present invention The array of the present invention is equipped with a nucleic acid probe corresponding to the TagSNP information (first step) selected using the selection method or computer system of the present invention described above (second step). ), That is, (a) a first step of selecting a TagSNP according to the selection method of the present invention; (b) detecting the genotype of the TagSNP in the human genome in the sample based on the TagSNP selected by the first step For producing the nucleic acid probe for the DNA microarray. As the second step, known methods can be widely used, and new means for producing a DNA microarray to be provided in the future can be used as long as the effects of the present invention are not impaired.
 核酸プローブの調製は、例えば、目的とするSNP塩基を含むヒトゲノムの塩基配列に対して、適切な増幅用プライマーを用いてPCR法やRNAPCR法等の遺伝子増幅法や、DNAの化学合成法等を行うことにより、プローブの基となるDNA断片を得ることができる。当該DNA断片の塩基長は特に限定されないが、10~100塩基長、さらに好適には10~40塩基長である。当該DNA断片の塩基長が長いとプローブにおけるSNP塩基を含むターゲット核酸の捕捉能力は高くなるが、高密度のDNAマイクロアレイには適さない傾向がある。その反面で当該塩基長が短いとターゲット核酸の捕捉能力が劣る傾向も認められる。これらの利点と欠点を勘案して、DNAマイクロアレイに搭載する核酸プローブの塩基長を設計して製造することができる。核酸プローブとして用いるために、上記DNA断片には修飾を施してもよく、公知の修飾方法を用いることが可能である。修飾に用いられるものとしては各種の蛍光色素や発色色素等、この分野で用いられるものを適宜用いればよく、これらに限定されるものではない。 Nucleic acid probes can be prepared using, for example, gene amplification methods such as PCR and RNA PCR, DNA chemical synthesis, etc., using appropriate amplification primers for the base sequence of the human genome containing the target SNP base. By carrying out the procedure, a DNA fragment serving as a probe group can be obtained. The base length of the DNA fragment is not particularly limited, but is 10 to 100 bases, more preferably 10 to 40 bases. If the base length of the DNA fragment is long, the ability of the probe to capture the target nucleic acid containing the SNP base increases, but it tends to be unsuitable for high-density DNA microarrays. On the other hand, if the base length is short, the target nucleic acid capturing ability tends to be poor. Considering these advantages and disadvantages, the base length of the nucleic acid probe mounted on the DNA microarray can be designed and manufactured. For use as a nucleic acid probe, the DNA fragment may be modified, and a known modification method can be used. What is used for modification may be appropriately used those used in this field, such as various fluorescent dyes and coloring dyes, and is not limited thereto.
 上記のようにして、本発明に基づいて選択されたTagSNPをターゲットとして、検体由来のDNA試料と接触させることにより捕捉して捕捉シグナルをDNAマイクロアレイ上において発生させることが可能な核酸プローブが調製される。 As described above, a nucleic acid probe capable of generating a capture signal on a DNA microarray by capturing a tag SNP selected based on the present invention as a target and contacting with a sample-derived DNA sample is prepared. The
 このように予め調製された核酸プローブを担体上に付着させて固定化することにより、所望する核酸プローブを搭載したDNAマイクロアレイを生産することが可能である。担体としては、例えば、ガラス、プラスチック(例えば、ポリプロピレン、ナイロン等)、ポリアクリルアミド、ニトロセルロース、ゲル、他の多孔質材料または非多孔質材料等の材質の固相担体が挙げられる。 It is possible to produce a DNA microarray on which a desired nucleic acid probe is mounted by attaching and immobilizing a nucleic acid probe prepared in advance on a carrier. Examples of the carrier include a solid phase carrier made of a material such as glass, plastic (eg, polypropylene, nylon, etc.), polyacrylamide, nitrocellulose, gel, other porous material or non-porous material.
 担体表面への核酸プローブの付着方法としては、例えば、プレート上への印刷法が挙げられる。さらに、高密度アレイを生産させるための手法として、表面の規定位置における規定配列に相補的である数千のオリゴヌクレオチドを含むアレイを、フォトリソグラフィ合成技術を用いてin situで生成する技術や、予め設計したDNA鎖を迅速に合成し担体に直接付着させる方法等が挙げられ、さらにマスキング技術を用いてDNAマイクロアレイを生産することも可能である。また、オリゴヌクレオチド合成用インクジェット式印刷装置によって製造することも可能であり、蛍光ビーズや磁気ビーズを用いるDNAマイクロアレイを生産することも可能である。 Examples of the method for attaching the nucleic acid probe to the surface of the carrier include a printing method on a plate. Furthermore, as a technique for producing a high-density array, a technique for generating an array containing thousands of oligonucleotides complementary to a defined sequence at a defined position on the surface in situ using photolithography synthesis technology, Examples thereof include a method of rapidly synthesizing a predesigned DNA strand and directly attaching it to a carrier, and it is also possible to produce a DNA microarray using a masking technique. It can also be produced by an inkjet printing apparatus for oligonucleotide synthesis, and a DNA microarray using fluorescent beads or magnetic beads can be produced.
 これらの手法を駆使することにより、本発明により選択されたTagSNPを検出可能なDNAマイクロアレイを生産することが可能である。自家生産の他に、例えば、マイクロアレイの生産受託を行っている企業に依頼して、「市販品」として得ることも可能となっている。 By making full use of these techniques, it is possible to produce a DNA microarray capable of detecting the TagSNP selected according to the present invention. In addition to in-house production, for example, it is possible to obtain a “commercial product” by requesting a company that has contracted production of microarrays.
 このように生産される本発明のアレイは、DNA検体と接触させることにより当該DNA検体における本発明により選択されたTagSNPにおける塩基置換の存在を、個々のスポットのシグナルとして検出することにより、SNPがホモ型であるかヘテロ型であるかを含めて確認することができる。得られた結果を統合・整理し、インピュテーションを行うことによりDNAマイクロアレイに搭載されていない、すなわちTagSNP以外のTargetSNP情報を推定することが可能であり、当該情報は被験者の健康管理等に活用することが可能である。用いられるDNA検体は、微量であってもヒトゲノムDNAが得られる対象であれば特に限定されず、例えば、血液、唾液、尿、糞便、汗、爪、毛髪、皮膚、口腔内組織、精液、髄液、リンパ液等が挙げられる。これらの原検体におけるゲノムDNAを精製することによって、DNA検体を得ることができる。 The array of the present invention produced as described above detects the presence of base substitution in the Tag SNP selected by the present invention in the DNA specimen by contacting with the DNA specimen as a signal of each spot, so that the SNP can be detected. It can be confirmed whether it is a homotype or a heterotype. By integrating and organizing the obtained results and performing imputation, it is possible to estimate TargetSNP information that is not mounted on the DNA microarray, that is, other than TagSNP. Is possible. The DNA sample to be used is not particularly limited as long as it is a target from which human genomic DNA can be obtained even in a small amount. For example, blood, saliva, urine, feces, sweat, nails, hair, skin, oral tissue, semen, marrow Fluid, lymph and the like. A DNA sample can be obtained by purifying the genomic DNA in these original samples.
 以下に本発明の実施例を開示する。 Examples of the present invention will be disclosed below.
[実施例1] TagSNPの選択
 上述したように、東北メディカル・メガバンク機構(ToMMo)でNGS(次世代シークエンサー)を用いて決定した日本人1070人の全ゲノムのデータファイルにおいて変異の見つかった染色体部位の情報から構成されるファイルに対して、図1に示した内容のコンピュータプログラムを実行して、DNAマイクロアレイに搭載する核酸プローブに含まれるべきTagSNPの選択を行った。
[Example 1] Selection of Tag SNP As described above, a chromosomal site in which mutations were found in a data file of the entire genome of 1070 Japanese people determined using NGS (Next Generation Sequencer) at Tohoku Medical Megabank Organization (ToMMo) The computer program having the contents shown in FIG. 1 was executed on the file composed of the above information to select TagSNPs to be included in the nucleic acid probe mounted on the DNA microarray.
 ここで、事前のTagSNP候補の絞り込みのために用いた「r連鎖不平衡値」の閾値は「r>0.8」であり、「一定範囲に定められた近傍」は、TagSNP候補の遺伝子座から±500kbpとして、本発明の選択方法を行った。DNAマイクロアレイに搭載するべき核酸プローブに用いたTagSNPの個数は67.5万個である。今回のTagSNP候補及びTargetSNPは、事前にAffymetrix社のDNAマイクロアレイにおいて解析実績のあるSNP群、約940万個の中から選択したが、このような事前の選択は必ずしも行う必要はない。例えば、任意のSNP群の中から無作為にTagSNP群とTargetSNP群を仮想して、本発明の選択方法を行うことも可能である。また、事前にMAFの低いSNPをTagSNP候補から除外することも効率的な手段である。さらに、TagSNPの既存のリスト等を基にして本発明の選択方法を行うことも可能である。 Here, the threshold value of “r 2 linkage disequilibrium value” used for narrowing down prior Tag SNP candidates is “r 2 > 0.8”, and “neighboring defined within a certain range” The selection method of the present invention was carried out at ± 500 kbp from the locus. The number of TagSNPs used for nucleic acid probes to be mounted on the DNA microarray is 675,000. The current Tag SNP candidate and Target SNP were selected from about 9.4 million SNP groups that have been analyzed in advance in Affymetrix DNA microarrays, but such prior selection is not necessarily performed. For example, the selection method of the present invention can be performed by randomly imagining a Tag SNP group and a Target SNP group from an arbitrary SNP group. It is also an efficient means to exclude a SNP having a low MAF from Tag SNP candidates in advance. Furthermore, the selection method of the present invention can also be performed based on an existing list of TagSNPs.
 本例では上記のようにして選択した675,000個(以下、原則として67.5万個と略記する)のTagSNP群を、前述の1070人とは異なる日本人131人のSNPの遺伝子型のインピュテーションを行うことにより性能の評価を行った。まずNGSを用いてSNPの遺伝子座と131人それぞれの遺伝子型の特定を行い、その中から本例で選択した67.5万個のTagSNP群に対応する遺伝子座の遺伝子型の情報を選び出した。ここで上記TagSNP群に対応する遺伝子型をNGSの解析結果によって特定することは、DNAマイクロアレイを用いて遺伝子型を特定することに対応する。次にこのTagSNP群に対応する131人の遺伝子型に、前述の1070人分のヒトゲノム情報を参照して131人分のSNPの遺伝子型を推定した(インピュテーション)。この推定結果を評価するために、インピュテーションによって推定した131人分の遺伝子型とNGSによって特定された遺伝子型の相関係数の二乗(r)を計算した。もし推定された結果が実験(NGS等)によって特定された結果が131人全員で完全に一致した場合、rは1.0になり、真の遺伝子型を完全に推定したことになり、逆に、真の遺伝子型とは異なる推定がなされた検体が増えるほどrの値は小さくなる。TagSNPの選択結果の評価を行うためにこのように計算されたrの平均を、推定対象のSNPのMAF毎の平均値として算出した。その結果として、MAF1~3%のSNPのrの平均値は0.81、MAF3~5%のSNPでは0.88、MAF5%以上では0.96という極めて優れたインピュテーション性能を示す結果が得られた。 In this example, 675,000 Tag SNP groups selected as described above (hereinafter abbreviated as 675,000 in principle) are used for 131 SNP genotypes different from the above 1070. The performance was evaluated by imputation. First, NGS was used to identify the SNP loci and the genotypes of 131 individuals, and the genotype information corresponding to the 675,000 Tag SNP groups selected in this example was selected from the SNP loci. . Here, specifying the genotype corresponding to the TagSNP group based on the analysis result of NGS corresponds to specifying the genotype using a DNA microarray. Next, 131 SNP genotypes were estimated for 131 human genotypes corresponding to this Tag SNP group by referring to the aforementioned human genome information for 1070 people (imputation). In order to evaluate this estimation result, the square (r 2 ) of the correlation coefficient between 131 genotypes estimated by imputation and the genotype specified by NGS was calculated. If the result of the estimated results are identified experimentally (NGS, etc.) is completely consistent with 131 people all, r 2 becomes 1.0, will be fully estimate the true genotype, reverse the smaller the value of about r 2 more different estimation has been made subject to the true genotype. The average of r 2 calculated in this way to evaluate the selection result of Tag SNP was calculated as the average value for each MAF of the SNP to be estimated. As a result, the average value of r 2 of the MAF1 ~ 3% of the SNP is 0.81, MAF3 ~ 5% in the SNP 0.88, the results show a very good imputation performance of 0.96 in MAF5% or more was gotten.
 上記の67.5万個のTagSNP群は、特願2014-223834号の実施例4(実施例4-1、4-2)において開示されている。 The above 675,000 Tag SNP groups are disclosed in Example 4 (Examples 4-1 and 4-2) of Japanese Patent Application No. 2014-223834.
[実施例2] 既存の商用DNAマイクロアレイとの比較(1)
 上記の実施例の比較として既存の商用DNAマイクロアレイに搭載されているSNPを使って、本例と同じ日本人131人のSNPの遺伝子型をインピュテーションによる推定を行った。その結果、Illumina社のHuman Omni 2.5-8(以下、単にOMNI2.5ともいう)のSNP情報を使ったインピュテーションではMAF1~3%のSNPのrの平均値は0.80、MAF3~5%のSNPでは0.87、MAF5%以上では0.96だった。この結果はほぼ上記の実施例と同等のインピュテーション性能であるが、当該商用DNAマイクロアレイの搭載SNP数は約230万個(正確には2,338,671個)であり、上記の実施例の67.5万個を大幅に上回っていた。すなわち、上記の実施例の方法で選択されたTagSNP群を用いたインピュテーションを行えば、既存の商用のDNAマイクロアレイに搭載されているSNPを用いた場合よりも極めて高い効率でSNPの遺伝子型を推定できる、という点で大きな利点があることが示された。
[Example 2] Comparison with existing commercial DNA microarray (1)
As a comparison with the above examples, using the SNPs mounted on the existing commercial DNA microarray, the genotypes of 131 SNPs as in this example were estimated by imputation. As a result, Illumina's Human Omni 2.5-8 (hereinafter, simply referred to as OMNI2.5) average of MAF1 ~ 3 percent of SNP r 2 is imputation using SNP information of 0.80, MAF3 ~ It was 0.87 for 5% SNP and 0.96 for MAF over 5%. This result is an imputation performance almost equivalent to that of the above example, but the number of SNPs mounted on the commercial DNA microarray is about 2.3 million (exactly 2,338,671). Greatly exceeded the 675,000 units. That is, if the imputation using the Tag SNP group selected by the method of the above embodiment is performed, the genotype of the SNP is much higher efficiency than when the SNP mounted on the existing commercial DNA microarray is used. It was shown that there is a great advantage in that it can be estimated.
[実施例3] 既存の商用DNAマイクロアレイとの比較(2)
 次いで、上記の67.5万個よりも少ない搭載数でどの程度のインピュテーション性能が認められるかの検証を、TagSNP数が67.5万個の他に、300,000個(以下、30万個と略記する)、400,000個(以下、40万個と略記する)、500,000個(以下、50万個と略記する)、及び600,000個(以下、60万個と略記する)の場合の検証を、MAF1~3%、3~5%、及び5%以上のそれぞれにおいて行った。その結果を表1において示す。なお、特願2014-223834号の実施例4-1には「30万個」、実施例4-2-1には「40万個」、実施例4-2-2には「50万個」、実施例4-2-3には「60万個」、及び、実施例4-2-4には「67.5万個」の、ここで用いられたTagSNPが具体的に開示されている。
[Example 3] Comparison with existing commercial DNA microarray (2)
Next, verification of how much the imputation performance is recognized with the number of mountings less than the above 675,000, in addition to the number of TagSNPs of 675,000, 300,000 (hereinafter, 30 Abbreviated as 10,000), 400,000 (hereinafter abbreviated as 400,000), 500,000 (hereinafter abbreviated as 500,000), and 600,000 (hereinafter abbreviated as 600,000). In the case of MAF 1 to 3%, 3 to 5%, and 5% or more. The results are shown in Table 1. In Example 4-1 of Japanese Patent Application No. 2014-223834, “300,000”, Example 4-2-1 “400,000”, and Example 4-2-2 “500,000”. In Example 4-2-3, “600,000” and Example 4-2-4 “675,000” were specifically disclosed TagSNPs used here. Yes.
Figure JPOXMLDOC01-appb-T000002
Figure JPOXMLDOC01-appb-T000002
 表1の結果から以下のことが判る。
1.上記表1の相対値から見て、本発明により得られたプローブの搭載数が50万個以上のDNAマイクロアレイであれば、OMNI2.5と同等以上のインピュテーション性能が得られる。
2.本発明により得られたプローブの搭載数をさらに少なくして40万個とした場合でも、OMNI2.5とほぼ同等な性能が得られる。
3.本発明により得られたプローブの搭載数をより一層少なくして30万個とした場合でも、OMNI2.5に比較して若干性能は劣るものの、同等に近い性能が得られ、上述したDNAマイクロアレイとしての基本的な性能が維持される。
From the results in Table 1, the following can be understood.
1. In view of the relative values in Table 1 above, if the number of probes mounted according to the present invention is 500,000 or more, an imputation performance equivalent to or higher than that of OMNI 2.5 can be obtained.
2. Even when the number of probes mounted according to the present invention is further reduced to 400,000, the same performance as OMNI 2.5 can be obtained.
3. Even when the number of probes obtained according to the present invention is further reduced to 300,000, although the performance is slightly inferior to that of OMNI 2.5, an equivalent performance can be obtained. The basic performance of is maintained.
 以上のことから、本発明により得られたプローブを搭載してDNAマイクロアレイを設計することにより、OMNI2.5のプローブ搭載数である約230万個に比較して、約1/10近くまでプローブ搭載数を少なくしても、OMNI2.5とほぼ同等の性能を有するDNAマイクロアレイを設計できることが明らかになった。 From the above, by designing the DNA microarray with the probe obtained according to the present invention, the probe is mounted to about 1/10 compared to the number of 2.3 million OMNI2.5 probes mounted. It was revealed that a DNA microarray having almost the same performance as OMNI2.5 can be designed even if the number is reduced.
S1・・・本発明のプログラムの第1の機能を記述するステップ
S1’・・・上記S1以降選択されるTagSNPとTargetSNPの初期設定状態を記述するステップ
S2・・・本発明のプログラムの第2の機能の前半を記述するステップ
S2-1(1)・・・S2における第1番目のはじめのループ端としての機能を記述するステップ
S2-2・・・TagSNP候補の初期化を記述するステップ
S2-3(1)・・・S2における第2番目のはじめのループ端としての機能を記述するステップ
S2-4・・・スコアの計算を行うか否かの判断を記述するステップ
S2-5・・・スコアが計算されたTagSNPのスコアの加算を記述するステップ
S2-3(2)・・・上記S2-3(1)のループの終端であることを記述するステップ
S2-1(2)・・・上記S2-1(1)のループの終端であることを記述するステップ
S3・・・S2により算出されたスコアが最大のTagSNP候補1個の選び出しを記述するステップ
S3-1・・・スコアが最大のTagSNP候補の番号を記述するステップ
S3-2(1)・・・S3におけるはじめのループ端としての機能を記述するステップ
S3-3・・・次ステップにおける更新記述を行うか否かの判断を記述するステップ
S3-4・・・更新記述を行う機能を記述するステップ
S3-2(2)・・・上記S3-2(1)のループの終端であることを記述するステップ
S4・・・選択したTagSNP候補の数が予定数に達しているか否かの判断を記述するステップ
S1... Step S1 ′ describing the first function of the program of the present invention S1 ′... Step S2 describing the initial setting state of Tag SNP and Target SNP selected after S1. Step S2-1 (1) describing the first half of the function of Step S2-1 describing the function as the first first loop end in S2 Step S2 describing initialization of the TagSNP candidate -3 (1) ... step S2-4 describing the function as the second first loop end in S2 ... step S2-5 describing the determination as to whether or not to calculate the score Step S2-3 (2) describing addition of the score of the TagSNP for which the score has been calculated Step S2-1 (2) describing that it is the end of the loop of S2-3 (1) Step S3 describing that it is the end of the loop of S2-1 (1) above. Step S3-1 describing the selection of one TagSNP candidate with the maximum score calculated in S2. Step S3-2 (1) describing the maximum Tag SNP candidate number Step S3-3 describing the function as the first loop end in S3 Judgment whether or not to perform update description in the next step Step S3-4 for describing the step S3-2 (2) for describing the function for performing the update description Step S4 for describing the end of the loop of the above S3-2 (1) Step of describing determination of whether or not the number of selected TagSNP candidates has reached the predetermined number

Claims (38)

  1.  複数個人の遺伝子型が特定されたSNP群の情報が含まれるヒトゲノム情報を用いて、ヒトゲノムのSNP情報をインピュテーションするための手段として用いるTagSNPに対応する核酸プローブ群を構成するために、当該TagSNPを選択する方法であって、
     a)当該ヒトゲノム情報中のSNP群を母集団として、その中でTagSNP候補となる各々のSNPの遺伝子座から一定範囲に定められた近傍に存在するSNPをTargetSNPとして、当該TagSNP候補とこれらのTargetSNPとの間の相互情報量の和を算出し、
     b)全TagSNP候補の中から、前記相互情報量の総和の値が大きいTagSNP候補を、上記のインピュテーションするための手段として用いる核酸プローブ中に存在させるTagSNPとして、当該総和の大きい順に選択する、
     ことを特徴とする、TagSNPの選択方法。
    In order to construct a nucleic acid probe group corresponding to a Tag SNP used as a means for imputing human genome SNP information using human genome information including information on SNP groups in which genotypes of a plurality of individuals are specified, A method of selecting a Tag SNP,
    a) Using the SNP group in the human genome information as a population, the SNPs present in the vicinity defined within a certain range from the locus of each SNP as a Tag SNP candidate are Target SNPs, and the Tag SNP candidates and these Target SNPs And calculate the sum of mutual information between
    b) From all TagSNP candidates, select TagSNP candidates having a large sum of mutual information amounts as TagSNPs to be present in the nucleic acid probe used as the means for imputation, in order of increasing sum. ,
    A method for selecting a TagSNP, characterized in that
  2.  ヒトゲノム情報が、複数個人の遺伝子型が特定されたSNP群の情報が含まれるヒトゲノムデータベース情報であることを特徴とする、請求項1に記載のTagSNPの選択方法。 2. The method of selecting a Tag SNP according to claim 1, wherein the human genome information is human genome database information including information on SNP groups in which genotypes of a plurality of individuals are specified.
  3.  前記TagSNP候補の各々について、相互情報量の総和を算出するために用いられるTargetSNP群を、予め相互情報量以外の指標により絞り込むことを特徴とする、請求項1又は2に記載のTagSNPの選択方法。 3. The TagSNP selection method according to claim 1 or 2, wherein a TargetSNP group used for calculating a sum of mutual information amounts for each of the TagSNP candidates is narrowed down in advance using an index other than the mutual information amount. .
  4.  相互情報量以外の指標が、前記TagSNP候補から一定範囲に定められた近傍に存在するTargetSNP群との連鎖不平衡値であることを特徴とする、請求項3に記載のTagSNPの選択方法。 4. The method of selecting a Tag SNP according to claim 3, wherein the index other than the mutual information amount is a linkage disequilibrium value with a Target SNP group existing in the vicinity defined within a certain range from the Tag SNP candidates.
  5.  連鎖不平衡値は、r連鎖不平衡値であることを特徴とする、請求項4に記載のTagSNPの選択方法。 Linkage disequilibrium values, characterized in that it is a r 2 linkage disequilibrium values, tagSNP selection method of claim 4.
  6.  一定範囲に定められた近傍が、当該TagSNP塩基から上流及び下流へそれぞれ500kbp以内であることを特徴とする、請求項1~5のいずれかに記載のTagSNPの選択方法。 6. The TagSNP selection method according to any one of claims 1 to 5, characterized in that the vicinity defined in a certain range is within 500 kbp upstream and downstream from the TagSNP base.
  7.  上記のインピュテーションするための手段として用いる核酸プローブのために選択するTagSNPの個数は、当該手段によるインピュテーションを行った結果が所定の性能を満たす個数以上であることを特徴とする、請求項1~6のいずれかに記載のTagSNPの選択方法。 The number of TagSNPs selected for the nucleic acid probe used as the means for imputation described above is equal to or greater than the number that satisfies the predetermined performance as a result of performing the imputation by the means. Item 7. The method for selecting a Tag SNP according to any one of Items 1 to 6.
  8.  前記所定の性能が、インピュテーションにより推定されたMAF5%のSNPの遺伝子型と実際の遺伝子型との相関係数の二乗の平均が0.94以上であることを特徴とする、請求項7に記載のTagSNPの選択方法。 The predetermined performance is characterized in that an average of squares of correlation coefficients between MAF 5% SNP genotype and actual genotype estimated by imputation is 0.94 or more. The method for selecting TagSNP described in 1.
  9.  前記ヒトゲノム情報が、特定の人種、又は、それよりも小さなカテゴリーに属する人類集団に由来するものであることを特徴とする、請求項1~8のいずれかに記載のTagSNPの選択方法。 9. The TagSNP selection method according to claim 1, wherein the human genome information is derived from a specific race or a human group belonging to a smaller category.
  10.  前記選択方法によるTagSNPの選択とは別個に、他の1種又は2種以上のSNPを選択して、これらの他のSNPを、当該TagSNPに優先して繰り入れることを特徴とする、請求項1~9のいずれかに記載のTagSNPの選択方法。 In addition to the selection of a Tag SNP by the selection method, one or more other SNPs are selected, and these other SNPs are transferred in preference to the Tag SNP. The method for selecting TagSNP according to any one of 1 to 9.
  11.  前記核酸プローブ群は、DNAマイクロアレイへ搭載されるための核酸プローブ群であることを特徴とする、請求項1~10のいずれかに記載のTagSNPの選択方法。 The TagSNP selection method according to any one of claims 1 to 10, wherein the nucleic acid probe group is a nucleic acid probe group to be mounted on a DNA microarray.
  12.  請求項11に記載のTagSNPの選択方法に従い選択されたTagSNPに対応する核酸プローブが搭載されていることを特徴とする、DNAマイクロアレイ。 A DNA microarray on which a nucleic acid probe corresponding to a TagSNP selected according to the TagSNP selection method according to claim 11 is mounted.
  13.  下記の工程(1)及び(2)を含むことを特徴とする、DNAマイクロアレイの生産方法。
     (1)請求項11に記載の選択方法に従い、TagSNPを選択する第1工程;
     (2)第1工程により選択されたTagSNPに基づいて、検体中のヒトゲノム中の当該TagSNPの遺伝子型を検出するための核酸プローブを、DNAマイクロアレイに搭載する第2工程。
    A method for producing a DNA microarray, comprising the following steps (1) and (2):
    (1) A first step of selecting a TagSNP according to the selection method according to claim 11;
    (2) A second step of mounting a nucleic acid probe for detecting the genotype of the TagSNP in the human genome in the sample on the DNA microarray based on the TagSNP selected in the first step.
  14.  複数個人の遺伝子型が特定されたSNP群の情報が含まれるヒトゲノム情報を用いて、ヒトゲノムのSNP情報をインピュテーションするための手段として用いるTagSNPに対応する核酸プローブ群を構成するために当該TagSNPを選択するコンピュータシステムであって、記録部と演算処理部とを備え;
    (A) 当該記録部には、当該ヒトゲノム情報から読み出されたTagSNP候補の情報、及び、それらのTagSNP候補の遺伝子座から一定範囲に定められた近傍に存在するSNPの情報をTargetSNP情報として、
     (1)各々のTagSNP候補のヒトゲノム上の遺伝子座、
     (2)個々のヒトゲノム情報におけるTagSNP候補の遺伝子型、
     (3)TargetSNPのヒトゲノム上の遺伝子座、
     (4)個々のヒトゲノム情報におけるTargetSNPの遺伝子型、
     が少なくとも記録されており;
    (B) 当該演算処理部は、前記記録部から(A)の(1)~(4)の情報に基づいて個々のTagSNP候補毎に対応するTargetSNPとの間の相互情報量の和を計算し、これらの中で当該和が最大のTargetSNP候補を選択して、第一のTagSNPとして選択を行い;
    (C) これまでに選択されたTagSNPと対応するTargetSNP群の情報が抜去された、前記TagSNP情報及びTargetSNP情報を基にして、再度前記(B)工程により最大の相互情報量の和を伴うTagSNP候補を選択して、第二のTagSNPとして選択を行い;
    (D) 前記工程(B)、(C)を繰り返して、この繰り返し工程を第M(Mは自然数)のTagSNPの選択のために行い、この自然数Mの値が、定められたインピュテーションするための手段として用いる核酸プローブの予定数に達するまで、残りM-2回の当該繰り返し工程を行う;
     ことを特徴とする、TagSNPを選択するコンピュータシステム。
    In order to construct a nucleic acid probe group corresponding to TagSNP used as means for imputing SNP information of human genome using human genome information including information of SNP groups in which genotypes of a plurality of individuals are specified, A computer system comprising: a recording unit and an arithmetic processing unit;
    (A) In the recording unit, information on TagSNP candidates read from the human genome information, and information on SNPs present in the vicinity defined by a certain range from the gene loci of those TagSNP candidates are used as TargetSNP information.
    (1) a locus on the human genome of each TagSNP candidate;
    (2) Genotypes of TagSNP candidates in individual human genome information,
    (3) the locus of TargetSNP on the human genome,
    (4) Genotype of TargetSNP in individual human genome information,
    Is recorded at least;
    (B) The arithmetic processing unit calculates the sum of mutual information amounts with the Target SNP corresponding to each Tag SNP candidate based on the information of (1) to (4) of (A) from the recording unit. , Among these, select the TargetSNP candidate with the largest sum and select it as the first Tag SNP;
    (C) TagSNP with the sum of the maximum mutual information by the step (B) again based on the TagSNP information and TargetSNP information from which the information of the TargetSNP group corresponding to the TagSNP selected so far has been extracted Select a candidate and select as a second Tag SNP;
    (D) The steps (B) and (C) are repeated, and this repetition step is performed for selecting the Mth (M is a natural number) TagSNP, and the value of the natural number M is imputed by a predetermined imputation. Performing the remaining M-2 iterations until the expected number of nucleic acid probes to be used as a means for achieving is reached;
    The computer system which selects TagSNP characterized by the above-mentioned.
  15.  ヒトゲノム情報が、複数個人の遺伝子型が特定されたSNP群の情報が含まれるヒトゲノムデータベース情報であることを特徴とする、請求項14に記載のTagSNPを選択するコンピュータシステム。 The computer system for selecting a Tag SNP according to claim 14, wherein the human genome information is human genome database information including information on SNP groups in which genotypes of a plurality of individuals are specified.
  16.  演算処理部における相互情報量の計算に際しては、対象となるSNP群の遺伝子型が決定され、かつ、(1)各々のTagSNP候補の遺伝子型の頻度、(2)各々のTagSNP候補の遺伝子座から一定範囲に定められた近傍内に存在するTargetSNP各々の遺伝子型の頻度、及び、(3)当該TagSNP候補とTargetSNP候補の遺伝子型の組み合わせの頻度、が算出されることを特徴とする、請求項14又は15に記載のTagSNPを選択するコンピュータシステム。 In calculating the mutual information amount in the arithmetic processing unit, the genotype of the target SNP group is determined, and (1) the frequency of the genotype of each Tag SNP candidate, and (2) the gene locus of each Tag SNP candidate The frequency of genotypes of each TargetSNP existing in the vicinity defined within a certain range, and (3) the frequency of the combination of genotypes of the TagSNP candidate and TargetSNP candidate are calculated. The computer system which selects TagSNP of 14 or 15.
  17.  前記TagSNP候補の各々について、相互情報量の総和を算出するために用いられるTargetSNP群を、予め相互情報量以外の指標により絞り込むことを特徴とする、請求項14~16のいずれかに記載のTagSNPを選択するコンピュータシステム。 The TagSNP according to any one of claims 14 to 16, characterized in that, for each of the TagSNP candidates, a TargetSNP group used for calculating a sum of mutual information amounts is narrowed down in advance by an index other than the mutual information amount. Choose computer system.
  18.  相互情報量以外の指標が、前記TagSNP候補から一定範囲に定められた近傍に存在するTargetSNP群との連鎖不平衡値であることを特徴とする、請求項17に記載のTagSNPを選択するコンピュータシステム。 18. The computer system for selecting a Tag SNP according to claim 17, wherein the index other than the mutual information is a linkage disequilibrium value with a Target SNP group existing in the vicinity defined within a certain range from the Tag SNP candidates. .
  19.  連鎖不平衡値は、r連鎖不平衡値であることを特徴とする、請求項18に記載のTagSNPを選択するコンピュータシステム。 Linkage disequilibrium values, characterized in that it is a r 2 linkage disequilibrium values, the computer system for selecting a TagSNP of claim 18.
  20.  一定範囲に定められた近傍が、当該TagSNP塩基から上流及び下流へそれぞれ500kbp以内であることを特徴とする、請求項14~19のいずれかに記載のTagSNPを選択するコンピュータシステム。 20. The computer system for selecting a TagSNP according to any one of claims 14 to 19, characterized in that the vicinity defined in a certain range is within 500 kbp upstream and downstream from the TagSNP base.
  21.  インピュテーションをするための手段として用いる核酸プローブのために選択するTagSNPの個数は、当該手段によるインピュテーションを行った結果が所定の性能を満たす個数以上であることを特徴とする、請求項14~20のいずれかに記載のTagSNPを選択するコンピュータシステム。 The number of Tag SNPs to be selected for a nucleic acid probe used as a means for imputation is equal to or greater than a number that satisfies a predetermined performance as a result of imputation by the means. A computer system for selecting the TagSNP according to any one of 14 to 20.
  22.  前記所定の性能が、インピュテーションにより推定されたMAF5%のSNPの遺伝子型と実際の遺伝子型との相関係数の二乗の平均が0.94以上になることを特徴とする、請求項21に記載のTagSNPを選択するコンピュータシステム。 The predetermined performance is characterized in that the mean square of the correlation coefficient between the SNP genotype of MAF 5% estimated by imputation and the actual genotype is 0.94 or more. The computer system which selects TagSNP of description.
  23.  前記コンピュータシステムにおけるTagSNPの選択とは別個に、他の1種又は2種以上のSNPが選択され、当該他のSNPが、核酸プローブを特徴付けるべきSNPとして優先して繰り入れられることを特徴とする、請求項14~22のいずれかに記載のTagSNPを選択するコンピュータシステム。 Separately from the selection of Tag SNP in the computer system, one or more other SNPs are selected, and the other SNPs are preferentially introduced as SNPs to characterize the nucleic acid probe. A computer system for selecting the TagSNP according to any one of claims 14 to 22.
  24.  前記核酸プローブ群は、DNAマイクロアレイへ搭載されるための核酸プローブ群であることを特徴とする、請求項14~23のいずれかに記載のTagSNPの選択するためのコンピュータシステム。 The computer system for selecting TagSNP according to any one of claims 14 to 23, wherein the nucleic acid probe group is a nucleic acid probe group to be mounted on a DNA microarray.
  25.  複数個人の遺伝子型が特定されたSNP群の情報が含まれるヒトゲノム情報を用いて、ヒトゲノムのSNP情報をインピュテーションするための手段として用いるTagSNPに対応する核酸プローブ群を構成するために当該TagSNPを選択するコンピュータプログラムであって、コンピュータに、
    (A) ヒトゲノム情報から読み出された当該TagSNP候補の情報、及び、それらのTagSNP候補の遺伝子座から一定範囲に定められた近傍に存在するSNPの情報をTargetSNP情報として、
     (1)各々のTagSNP候補のヒトゲノム上の遺伝子座、
     (2)個々のヒトゲノム情報におけるTagSNP候補の遺伝子型、
     (3)TargetSNPのヒトゲノム上の遺伝子座、
     (4)個々のヒトゲノム情報におけるTargetSNPの遺伝子型、
     が記録されている記録部から、これら(1)~(4)の情報を演算処理部における処理のために読み出す、第一の機能;
    (B) 前記第一の機能により読み出された(1)~(4)の情報に基づき、個々のTagSNP候補毎に対応するTargetSNPとの間の相互情報量の和を計算し、これらの中で当該和が最大のTargetSNP候補を選択して第一のTagSNPとして選択を行う、第二の機能;
    (C) これまでに選択されたTagSNPと対応するTargetSNP群の情報が抜去された、前記TagSNP情報及びTargetSNP情報を基にして、再度前記第二の機能により、最大の相互情報量の和を伴うTagSNP候補を選択して、第二のTagSNPとして選択を行い、以降工程(B)、(C)を繰り返して、この繰り返し工程を第M(Mは自然数)のTagSNPの選択のため残りM-2回行い、この自然数Mの値が、定められたインピュテーションするための手段として用いる核酸プローブの予定数に達するまで行う、第三の機能;
     を実現させるアルゴリズムが含まれることを特徴とする、コンピュータプログラム。
    In order to construct a nucleic acid probe group corresponding to TagSNP used as means for imputing SNP information of human genome using human genome information including information of SNP groups in which genotypes of a plurality of individuals are specified, A computer program for selecting
    (A) Target SNP information, which is information on the Tag SNP candidates read from the human genome information, and information on SNPs present in the vicinity defined in a certain range from the locus of those Tag SNP candidates,
    (1) a locus on the human genome of each TagSNP candidate;
    (2) Genotypes of TagSNP candidates in individual human genome information,
    (3) the locus of TargetSNP on the human genome,
    (4) Genotype of TargetSNP in individual human genome information,
    A first function for reading the information of (1) to (4) for processing in the arithmetic processing unit from the recording unit in which is recorded;
    (B) Based on the information of (1) to (4) read out by the first function, the sum of mutual information amounts with the Target SNP corresponding to each Tag SNP candidate is calculated. A second function for selecting the Target SNP candidate with the largest sum and selecting it as the first Tag SNP;
    (C) Based on the Tag SNP information and Target SNP information from which the information of the Target SNP group corresponding to the Tag SNP selected so far has been extracted, the second function again causes the maximum mutual information amount to be added. A Tag SNP candidate is selected and selected as a second Tag SNP, and then the steps (B) and (C) are repeated, and this repetition step is repeated for the selection of the Mth Tag SNP (M is a natural number). A third function that is performed until the natural number M reaches a predetermined number of nucleic acid probes to be used as a means for imputation.
    The computer program characterized by including the algorithm which implement | achieves.
  26.  ヒトゲノム情報が、複数個人の遺伝子型が特定されたSNP群の情報が含まれるヒトゲノムデータベース情報であることを特徴とする、請求項25に記載のコンピュータプログラム。 26. The computer program according to claim 25, wherein the human genome information is human genome database information including information on SNP groups in which genotypes of a plurality of individuals are specified.
  27.  前記第二の機能において、(1)各々のTagSNP候補の遺伝子型の頻度、(2)各々のTagSNP候補の遺伝子座から一定範囲に定められた近傍内に存在するTargetSNP各々の遺伝子型の頻度、及び、(3)当該TagSNP候補とTargetSNP候補の遺伝子型の組み合わせの頻度、が算出されるアルゴリズムが含まれることを特徴とする、請求項25又は26に記載のコンピュータプログラム。 In the second function, (1) the frequency of the genotype of each Tag SNP candidate, (2) the frequency of the genotype of each Target SNP existing within a certain range from the locus of each Tag SNP candidate, 27. The computer program according to claim 25 or 26, further comprising: (3) an algorithm for calculating a frequency of the combination of the Tag SNP candidate and the Target SNP candidate genotype.
  28.  前記第二の機能を実現させるアルゴリズムの前段階において、相互情報量以外の指標によりTagSNP候補を選択して、前記第二の機能を行う対象となるTagSNP候補群の予備的な絞り込みを実現させるアルゴリズムが設けられていることを特徴とする、請求項25~27のいずれかに記載のコンピュータプログラム。 Algorithm for realizing preliminary narrowing down of TagSNP candidate groups to be subjected to the second function by selecting TagSNP candidates by an index other than the mutual information amount in a previous stage of the algorithm for realizing the second function. The computer program according to any one of claims 25 to 27, wherein:
  29.  相互情報量以外の指標が、前記TagSNP候補から一定範囲に定められた近傍に存在するTargetSNP群との連鎖不平衡値であることを特徴とする、請求項28に記載のコンピュータプログラム。 29. The computer program according to claim 28, wherein the index other than the mutual information amount is a linkage disequilibrium value with a TargetSNP group existing in the vicinity defined within a certain range from the TagSNP candidate.
  30.  連鎖不平衡値は、r連鎖不平衡値であることを特徴とする、請求項29に記載のコンピュータプログラム。 Linkage disequilibrium values, characterized in that it is a r 2 linkage disequilibrium values, the computer program of claim 29.
  31.  一定範囲に定められた近傍が、当該TagSNP塩基から上流及び下流へそれぞれ500kbp以内であることを特徴とする、請求項25~30のいずれかに記載のコンピュータプログラム。 The computer program according to any one of claims 25 to 30, wherein the vicinity defined in a certain range is within 500 kbp upstream and downstream from the TagSNP base.
  32.  インピュテーションをするための手段として用いる核酸プローブのために選択するTagSNPの個数は、当該手段によるインピュテーションを行った結果が所定の性能を満たす個数以上であることを特徴とする、請求項25~31のいずれかに記載のコンピュータプログラム。 The number of Tag SNPs to be selected for a nucleic acid probe used as a means for imputation is equal to or greater than a number that satisfies a predetermined performance as a result of imputation by the means. The computer program according to any one of 25 to 31.
  33.  前記所定の性能が、インピュテーションにより推定されたMAF5%のSNPの遺伝子型と実際の遺伝子型との相関係数の二乗の平均が0.94以上になることを特徴とする、請求項32記載のコンピュータプログラム。 The predetermined performance is characterized in that an average of squares of correlation coefficients between MAF 5% SNP genotype and actual genotype estimated by imputation is 0.94 or more. The computer program described.
  34.  TagSNPの選択とは別個に、他の1種又は2種以上のSNPを選択して、これらの他のSNPを、選択されるべきSNPとして優先して特定することを実現させるアルゴリズムが設けられていることを特徴とする、請求項25~33のいずれかに記載のコンピュータプログラム。 Independent of the selection of Tag SNPs, an algorithm is provided that enables one or more other SNPs to be selected and these other SNPs to be specified preferentially as SNPs to be selected. The computer program according to any one of claims 25 to 33, characterized by comprising:
  35.  前記核酸プローブ群は、DNAマイクロアレイへ搭載されるための核酸プローブ群であることを特徴とする、請求項25~34のいずれかに記載のコンピュータプログラム。 The computer program according to any one of claims 25 to 34, wherein the nucleic acid probe group is a nucleic acid probe group to be mounted on a DNA microarray.
  36.  請求項25~35のいずれかに記載のコンピュータプログラムが記録されていることを特徴とする、コンピュータにおいて読み取り可能な記録媒体。 A computer-readable recording medium on which the computer program according to any one of claims 25 to 35 is recorded.
  37.  請求項25~35のいずれかに記載のコンピュータプログラムを実行することを特徴とする、請求項14~24のいずれかに記載のTagSNPを選択するためのコンピュータシステム。 A computer system for selecting a TagSNP according to any one of claims 14 to 24, wherein the computer program according to any one of claims 25 to 35 is executed.
  38.  ヒトゲノム情報が、特定の人種、又は、それよりも小さなカテゴリーに属する人類集団に由来するものであることを特徴とする、請求項14~24、及び、37のいずれかに記載のTagSNPを選択するためのコンピュータシステム。 The TagSNP according to any one of claims 14 to 24 and 37, wherein the human genome information is derived from a specific race or a human group belonging to a smaller category. Computer system to do.
PCT/JP2015/067686 2014-06-20 2015-06-19 METHOD, COMPUTER SYSTEM AND SOFTWARE FOR SELECTING Tag SNP, AND DNA MICROARRAY EQUIPPED WITH NUCLEIC ACID PROBE CORRESPONDING TO Tag SNP SELECTED BY SAID SELECTION METHOD WO2015194655A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/320,438 US20170147745A1 (en) 2014-06-20 2015-06-19 Method, computer system and software for selecting tag snp, and dna microarray equipped with nucleic acid probe corresponding to tag snp selected by said selection method

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
JP2014-126910 2014-06-20
JP2014126910 2014-06-20
JP2014-223834 2014-11-01
JP2014223834A JP6432974B2 (en) 2014-06-20 2014-11-01 TagSNP Selection Method, Selection Computer System, and Selection Software

Publications (1)

Publication Number Publication Date
WO2015194655A1 true WO2015194655A1 (en) 2015-12-23

Family

ID=54935631

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/067686 WO2015194655A1 (en) 2014-06-20 2015-06-19 METHOD, COMPUTER SYSTEM AND SOFTWARE FOR SELECTING Tag SNP, AND DNA MICROARRAY EQUIPPED WITH NUCLEIC ACID PROBE CORRESPONDING TO Tag SNP SELECTED BY SAID SELECTION METHOD

Country Status (2)

Country Link
TW (1) TW201617444A (en)
WO (1) WO2015194655A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11001880B2 (en) 2016-09-30 2021-05-11 The Mitre Corporation Development of SNP islands and application of SNP islands in genomic analysis

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030033290A1 (en) * 2001-05-24 2003-02-13 Garner Harold R. Program for microarray design and analysis
JP2008250971A (en) * 2007-03-02 2008-10-16 Toray Ind Inc Micro rna target gene predicting device, micro rna target gene predicting method, and program
US20090004652A1 (en) * 2006-06-09 2009-01-01 Rubin Mark A Methods for identifying and using SNP panels

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030033290A1 (en) * 2001-05-24 2003-02-13 Garner Harold R. Program for microarray design and analysis
US20090004652A1 (en) * 2006-06-09 2009-01-01 Rubin Mark A Methods for identifying and using SNP panels
JP2008250971A (en) * 2007-03-02 2008-10-16 Toray Ind Inc Micro rna target gene predicting device, micro rna target gene predicting method, and program

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Nipponjin ni Saitekika sareta SNP Array 'Japonica Array R' o Sekkei", 24 June 2015 (2015-06-24), Retrieved from the Internet <URL:https://www.tohoku.ac.jp/japanese/newimg/pressimg/tohokuuniv-press20150624_01web.pdf> [retrieved on 20150703] *
THE INTERNATIONAL HAPMAP 3 CONSORTIUM: "Integrating common and rare genetic variation in diverse human populations", NATURE, 2 September 2010 (2010-09-02), XP055247268, Retrieved from the Internet <URL:http://www.nature.com/nature/journal/v467/n7311/full/nature09298.html> [retrieved on 20150703] *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11001880B2 (en) 2016-09-30 2021-05-11 The Mitre Corporation Development of SNP islands and application of SNP islands in genomic analysis

Also Published As

Publication number Publication date
TW201617444A (en) 2016-05-16

Similar Documents

Publication Publication Date Title
JP7368483B2 (en) An integrated machine learning framework for estimating homologous recombination defects
Alekseyev et al. A next-generation sequencing primer—how does it work and what can it do?
US20200027557A1 (en) Multimodal modeling systems and methods for predicting and managing dementia risk for individuals
Bock Analysing and interpreting DNA methylation data
Wilhelm-Benartzi et al. Review of processing and analysis methods for DNA methylation array data
McLoughlin Microarrays for pathogen detection and analysis
Detours et al. Absence of a specific radiation signature in post-Chernobyl thyroid cancers
KR102113896B1 (en) Noninvasive prenatal molecular karyotyping from maternal plasma
Konigsberg et al. Host methylation predicts SARS-CoV-2 infection and clinical outcome
US11581062B2 (en) Systems and methods for classifying patients with respect to multiple cancer classes
EP2473638A1 (en) Methods for non-invasive prenatal ploidy calling
WO2008079374A2 (en) Methods and compositions for selecting and using single nucleotide polymorphisms
TW202039860A (en) Cell-free dna end characteristics
Schnekenberg et al. Next-generation sequencing in childhood disorders
Breitling Biological microarray interpretation: the rules of engagement
Rehm et al. Keeping up with the genomes: scaling genomic variant interpretation
Mann et al. A thermodynamic approach to PCR primer design
Liu et al. Quantitative comparison of EST libraries requires compensation for systematic biases in cDNA generation
US20140052380A1 (en) Method and apparatus for analyzing personalized multi-omics data
US20180300450A1 (en) Systems and methods for performing and optimizing performance of dna-based noninvasive prenatal screens
Salam et al. Next generation diagnostics of heritable connective tissue disorders
WO2015194655A1 (en) METHOD, COMPUTER SYSTEM AND SOFTWARE FOR SELECTING Tag SNP, AND DNA MICROARRAY EQUIPPED WITH NUCLEIC ACID PROBE CORRESPONDING TO Tag SNP SELECTED BY SAID SELECTION METHOD
US20170147745A1 (en) Method, computer system and software for selecting tag snp, and dna microarray equipped with nucleic acid probe corresponding to tag snp selected by said selection method
KR20210040714A (en) Method and appartus for detecting false positive variants in nucleic acid sequencing analysis
US20200135300A1 (en) Applying low coverage whole genome sequencing for intelligent genomic routing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15810085

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 15320438

Country of ref document: US

122 Ep: pct application non-entry in european phase

Ref document number: 15810085

Country of ref document: EP

Kind code of ref document: A1