The lps data structure is designed for efficient parsing and representation of strings with advanced features such as reverse complement transformations, offset-based indexing, and binary file serialization/deserialization.
This document provides detailed instructions on initializing, using, and managing the lps data structure.
The lps structure is composed of two main components:
lpsstructure: Represents the main object with attributes likelevel,size, and an array ofcoreobjects.corestructure: Represents the individual elements stored in thelps, with attributes for bit representation, labels, and positional information.
struct lps {
int level;
int size;
struct core *cores;
};
struct core {
ubit_size bit_size; // Size of the bit representation
ublock *bit_rep; // Pointer to the bit representation
ulabel label; // Unique label for the core
uint64_t start; // Start index in the string
uint64_t end; // End index in the string
};Initializes an lps object using a given string.
Parameters:
struct lps *lps_ptr: Pointer to thelpsobject to initialize.const char *str: Input string to be parsed.int len: Length of the input string.
Usage:
struct lps my_lps;
init_lps(&my_lps, "ACGTACGT", 8);Initializes an lps object with offset-based indexing. This function adds the offset value to each cores' indices.
Parameters:
struct lps *lps_ptr: Pointer to thelpsobject to initialize.const char *str: Input string to be parsed.int len: Length of the input string.uint64_t offset: Offset value for indexing.
Usage:
struct lps my_lps;
init_lps_offset(&my_lps, "ACGTACGT", 8, 100);Initializes an lps object and includes a reverse complement transformation.
Parameters:
struct lps *lps_ptr: Pointer to thelpsobject to initialize.const char *str: Input string to be parsed.int len: Length of the input string.
Usage:
struct lps my_lps;
init_lps2(&my_lps, "ACGTACGT", 8);Initializes an lps object by reading from a binary file.
Parameters:
struct lps *lps_ptr: Pointer to thelpsobject to initialize.FILE *in: File pointer to the binary file containing serializedlpsdata.
Usage:
struct lps my_lps;
FILE *input_file = fopen("lps_data.bin", "rb");
init_lps3(&my_lps, input_file);
fclose(input_file);Initializes an lps object using divide and conquer approach.
Parameters:
struct lps *lps_ptr: Pointer to thelpsobject to initialize.const char *str: Input string to be parsed.int len: Length of the input string.int chunk_size: Size of the chunks to be processed.
Usage:
struct lps my_lps;
init_lps4(&my_lps, "ACGTACGT", 8, 100000);Frees dynamically allocated memory associated with an lps object.
Parameters:
struct lps *lps_ptr: Pointer to thelpsobject to free.
Usage:
free_lps(&my_lps);Serializes and writes an lps object to a binary file.
Parameters:
struct lps *lps_ptr: Pointer to thelpsobject to write.FILE *out: File pointer to the binary file for writing.
Usage:
FILE *output_file = fopen("lps_data.bin", "wb");
write_lps(&my_lps, output_file);
fclose(output_file);This section provides functions to manage the encoding of standard DNA bases (A, C, G, T) and their complements, used in the Locally Consistent Parsing (LCP) tool. Please note that any custom alphabet encoding can be provided to the program.
- Description: Displays the summary of the alphabet encoding, including coefficients and dictionary bit size. This function helps you understand the encoding setup and verify that the parameters are correctly configured.
- Usage: Simply call
LCP_SUMMARY()to print the encoding details to the console.
void LCP_SUMMARY();- Description: Initializes the encoding coefficients for the standard DNA bases (A, C, G, T) and their complements. The function sets default values for the coefficients and dictionary bit size.
- Usage: Call
LCP_INIT()to initialize the encoding with the default values.
void LCP_INIT();- Description: Initializes the encoding coefficients for standard DNA bases (A, C, G, T) and their reverse complements. Sets default values for coefficients and dictionary bit size. The verbosity of the output can be controlled by the
verboseparameter:- If
verboseis 0, no encoding summary will be printed. - If
verboseis 1, the encoding summary is printed after initialization.
- If
- Usage: Call
LCP_INIT2(0)orLCP_INIT2(1)based on whether you want the encoding summary printed.
void LCP_INIT2(int verbose);- Description: Initializes the encoding coefficients by reading them from a file. The file must contain three columns: the character (DNA base), the encoding value, and the complement encoding value for each base. After initializing the encoding coefficients, the function prints the encoding summary if
verboseis set to 1. - Parameters:
filename: The path to the file containing the character encodings. The file format should have the following columns: character (A, C, G, T), encoding value, and complement encoding value.verbose: If true (1), the encoding summary will be printed after initialization. If false (0), no summary will be printed.
- Returns: Always returns 0 upon successful initialization.
- Throws:
std::invalid_argumentif any invalid data is found in the file (e.g., missing or incorrect entries). - Usage: Call
LCP_INIT_FILE("path/to/encoding_file.txt", 1)to initialize the encoding from a file and print the summary.
int LCP_INIT_FILE(const char *filename, int verbose);The file for initializing encoding should be in the following format:
A 0 1
C 1 0
G 2 3
T 3 2
Where:
- The first column represents the DNA base (A, C, G, T).
- The second column represents the encoding value.
- The third column represents the complement encoding value.
#include <stdio.h>
#include "lps.h"
int main() {
LCP_INIT();
struct lps my_lps;
// Initialize from a string
init_lps(&my_lps, "ACGTACGT", 8);
// Serialize to a file
FILE *output_file = fopen("lps_data.bin", "wb");
write_lps(&my_lps, output_file);
fclose(output_file);
// Free memory
free_lps(&my_lps);
// Deserialize from a file
FILE *input_file = fopen("lps_data.bin", "rb");
init_lps3(&my_lps, input_file);
fclose(input_file);
// Free memory again
free_lps(&my_lps);
return 0;
}- The encoding coefficients define how each DNA base and its complement are represented internally. These values are essential for performing efficient parsing and comparison of DNA sequences.
- Ensure that the encoding file is properly formatted to avoid errors during initialization.
- Always ensure memory is freed after use to avoid leaks.
- Use valid and open binary files for serialization/deserialization functions.
- Offset and complement functions provide advanced features for genomic data analysis.