Isolate large text and binary data
Only after saving and reloading e. This will only be an issue for commands which rely on relative SNP positions e. If the LOG file does not show a message that the order of SNPs has changed after using --update-map , one need not worry. The name and chromosome code of a SNP can also be changed, by adding the modifiers --update-name or --update-chr , e.
You cannot update more than one attribute at a time for SNPs. Update allele information To recode alleles, for example from A,B allele coding to A,C,G,T coding, use the command --update-alleles , for example.
Force a specific reference allele It is possible to manually specify which allele is the A1 allele and which is A2. By default, the minor allele is assigned to be A1. All odds ratios, etc, are calculated with respect to the A1 allele i. To set a particular allele as A1 , which might not be the minor allele, use the command --reference-allele , which can be used with any other analysis or data generation command, e.
This command can make comparing results across studies easier, so that odds ratios reported can be made to be in the same direction as the other study, for example. Update individual information Rather than try to manually edit PED or FAM files which is not advised , use these functions to change ID codes, sex and parental information for individuals in a fileset.
The command plink --bfile mydata --update-ids recoded. Not all people need be listed in the file they will not be changed; the order of the file need not match the original dataset.
Two simular commands but that cannot be run at the same time as --update-ids are --update-sex myfile1. With all of these commands, you need to issue a data output command --make-bed , --recode , etc for the changes to be preserved. Write covariate files If a covariate file is specified along with any of the above --recode options or with --make-bed , then that covariate file will also be written, as plink.
This option is useful if the covariate file has a different number of individuals, or is ordered differently, to produce a set of covariate values that line up more easily with the newly-created genotype and phenotype files.
If you want just to create a revised version of the covariate file, but without creating a new set of genotype files, then use the --write-covar option. To also include phenotype information in the plink. This can be useful, for example, when used in conjunction with --recodeA to generate the files needed to replicate an analysis in R e. To recode a categorical variable to a set of binary dummy variables, add the command --dummy-coding for example.
A 1 5 0. Note that one level is automatically excluded 1 in this case, i. The command can operate on multiple covariates in a single file at the same time.
Note that missing values are correctly handled i. Write cluster files Similar to --write-covar , the --write-cluster will output the single selected cluster from the file specified by --within.
Unlike covariate files, this allows string labels to be used. The --dummy-coding can not currently be used with --write-cluster however.
To flip strand for just a subset of the sample e. HINT When merging two datasets, it is clearly very important that the two sets of SNPs are concordant in terms of positive or negative strand. Whereas some mismatches will be easy to spot as more than two alleles will be observed in the merged dataset, other instances will not be so easy to spot, i.
Using LD to identify incorrect strand assignment in a subset of the sample If cases and controls have been genotyped separately and then the data merged, it is always possible that strand has been incorrectly or incompletely assigned to each SNP, meaning that the merged data may contain a number of SNPs for which the allele coding differs between cases and controls or between any other grouping, such as collection site, etc.
If the two mis-matched groups correspond to cases and controls exactly, then rare SNPs will show a very strong association with disease e. More common SNPs could show intermediate levels of association that might be easier to confuse with a real signal. A simple approach to detect some proportion of such SNPs uses differential patterns of LD in cases versus controls: For these SNP pairs, it counts the number of times the signed correlation is different in sign between cases and controls a negative LD pair versus the same a positive LD pair.
For example, the command plink --bfile mydata --flip-scan produces the output file plink. In contrast, there is not a single SNP for which both cases and controls have a consistent pattern of LD. So, in this particular case, it would suggest that stand is flipped in either cases or controls. To display the specific sets of correlations in cases and controls for each SNP, add the option --flip-scan-verbose which generates a file plink.
This latter class of SNP would not cause problems of spurious association in single SNP analysis, but it could cause severe problems in haplotype and imputation analysis. Also, if more than one SNP in a region shows strand flips, or if there is a higher level of mis-coding alleles in general, then this approach may indicate that there are problems many NEG scores above 0 but it might be less clear how to remedy them. To know which to resolve cases or controls one would need to look at the frequency in other panels, or even the correlations, e.
Ideally, one would only need to do this for a small number of SNPs if any. The --flip and --flip-subset commands described above can then be used to flip the appropriate genotypes. Finally, the default threshold for counting can be changed by the following command: The number of flanking SNPs with are considered for each index SNP can be modified with the commands --ld-window 10 to set the number of SNPs considered upstream and downstream; the maximum physical distance away from the index SNP 1Mb by default is specified in kb with the command: A --recode or --make-bed , etc option is necessary to output the newly merged file; in this case, --out option will create the files merge-recode.
The --merge option can also be used with binary PED files, either as input or output, but not as the second file: If the second fileset data2. The two filesets can either overlap completely, partially, or not at all both in terms of markers and individuals. Imputed genotypes will be set to missing i. By default, any existing genotype data i. By specifying a --merge-mode this default behavior can be changed.
NOTE Alleles must be exactly coded to match: You can use the --allele and --alleleACGT commands prior to merging to convert datasets and then merge these consistently coded files you cannot convert and merge on the fly, i.
Consider, for an extreme example, the case where each fileset contains only a single SNP, and that there are thousands of these files -- this option would help build a single fileset, in this case. Then using the command plink --file fA --merge-list allfiles. In this case, the file allfiles.
The --merge-mode option can also be used with the --merge-list option, as described above: Extract a subset of SNPs: Based on a single chromosome --chr To analyse only a specific chromosome use plink --file data --chr 6 Based on a range of SNPs --from and --to To select a specific range of markers that must all fall on the same chromosome use, for example: The --snps command will accept a comma-delimited list of SNPs, including ranges based on physical position. For example, plink --bfile mydata --snps rsrs,rsrs,rs,rs selects the same range as above rs to rs but also the separate range rs to rs as well as the two individual SNPs rs and rs Note that SNPs need not be on the same chromosome; also, a range can span multiple chromosomes the range is defined based on chromosome code order in that case, as well as physical position, i.
No spaces are allowed between SNP names or ranges, i. Based on physical position --from-kb , etc One can also select regions based on a window defined in terms of physical distance rather than SNP ID, using the command: HINT Two alternate forms of the --from-kb command are --from-bp and --from-mb that take a parameter in terms of base-pair position or megabase position, instead of kilobase to be used with the corresponding --to-bp and --to-mb options.
One must combine this option with the desired analytic e. The format of myrange. For example, if the SET file genes. One must combine these options with the desired analytic e. As described above, the --range command can modify the behaviour of --exclude in the same manner as for --extract. Make missing a specific set of genotypes To blank out a specific set of genotypes, use the following commands, e. HINT See the section on handling obligatory missing genotype data, which can often be useful in this context.
Extract a subset of individuals To keep only certain individuals in a file, use the option: Remove a subset of individuals To remove certain individuals from a file plink --file data --remove mylist.
Filter out a subset of individuals Whereas the options to keep or remove individuals are based on files containing lists, it is also possible to specify a filter to include only certain individuals based on phenotype, sex or some other variable. The basic form of the command is --filter which takes two arguments, a filename and a value to filter on, for example: The filter can be any integer numeric value.
As with --pheno and --within , you can specify an offset to read the filter from a column other than the first after the obligatory ID columns.
Use the --mfilter option for this. For example, if you have a binary fileset, and so the FAM file contains phenotype as the sixth column, then you could specify plink --bfile data --filter data. Because filtering on cases or controls, or on sex, or on position within the family, will be common operations, there are some shortcut options that can be used instead of --filter. These are --filter-cases --filter-controls --filter-males --filter-females --filter-founders --filter-nonfounders These flags can be used in any circumstances, e.
Attribute filters for markers and individuals One can define an attribute file for SNPs or for individuals, see below that is simply a list of user-defined attributes for SNPs. For example, this might be a file snps. Not all SNPs need appear in this file; SNPs not in the dataset are allowed to appear they are just ignored ; the order does not need to be the same.
Each SNP should only be listed once however. If it is called more than once in succession for the same column, each call returns a successive part of the data.
The application is responsible for putting the long data together, which might mean concatenating the parts of the data. Each part is null-terminated; the application must remove the null-termination character if concatenating the parts. Retrieving data in parts can be done for variable-length bookmarks as well as for other long data. Must be accessed in order of increasing column number because of the way the columns of a result set are read from the data source.
Must have a higher column number than the last bound column. For this reason, applications should make sure to place long data columns at the end of the select list.
For more information, see Using Block Cursors. Some drivers do not enforce these restrictions. This restricts the number of bytes of data that will be returned for any character or binary column. For example, suppose a column contains long text documents.
An application that browses the table containing this column might have to display only the first page of each document. Although this statement attribute can be simulated in the driver, there is no reason to do this.
In particular, if an application wants to truncate character or binary data, it should bind a small buffer to the column with SQLBindCol and let the driver truncate the data.
The feedback system for this content will be changing soon.