Skip to content

spider-rs/auto-encoder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

auto_encoder

auto_encoder is a Rust library designed to automatically detect and encode various text and binary file formats, along with specific language encodings.

Features

  • Automatic Encoding Detection: Detects text encoding based on locale or content.
  • Binary Format Detection: Checks if a given file is a known binary format by inspecting its initial bytes.
  • HTML Language Detection: Extracts and detects the language of an HTML document from its content.

Installation

Add this to your Cargo.toml:

[dependencies]
auto_encoder = "0.1"

Usage

Encoding Detection

Automatically detect the encoding for a given locale:

use auto_encoder::encoding_for_locale;

let encoding = encoding_for_locale("ja-jp").unwrap();
println!("Encoding for Japanese locale: {:?}", encoding);

Encode bytes from a given HTML content and language:

use auto_encoder::encode_bytes_from_language;

let html_content = b"こんにちは、世界!";
let encoded = encode_bytes_from_language(html_content, "ja");
println!("Encoded content: {}", encoded);

Binary Format Detection

Check if a given file content is a known binary format:

use auto_encoder::is_binary_file;

let file_content = &[0xFF, 0xD8, 0xFF]; // JPEG file signature
let is_binary = is_binary_file(file_content);
println!("Is the file a known binary format? {}", is_binary);

HTML Language Detection

Detect the language attribute from an HTML document:

use auto_encoder::detect_language;

let html_content = br#"<html lang="en"><head><title>Test</title></head><body></body></html>"#;
let language = detect_language(html_content).unwrap();
println!("Language detected: {}", language);

API Documentation

Functions

encoding_for_locale

Get the encoding for a given locale if found.

pub fn encoding_for_locale(locale: &str) -> Option<&'static encoding_rs::Encoding>;

is_binary_file

Check if the file is a known binary format using its initial bytes.

pub fn is_binary_file(content: &[u8]) -> bool;

detect_language

Detect the language of an HTML resource based on its content.

pub fn detect_language(html_content: &[u8]) -> Option<String>;

encode_bytes

Get the content with proper encoding. Pass in a proper encoding label like SHIFT_JIS.

pub fn encode_bytes(html: &[u8], label: &str) -> String;

encode_bytes_from_language

Get the content with proper encoding based on a language code (e.g., ja for Japanese).

pub fn encode_bytes_from_language(html: &[u8], language: &str) -> String;

Supported Locales and Encodings

The library supports a wide range of locales and their corresponding encodings, such as WINDOWS_1252 for Western European languages, SHIFT_JIS for Japanese, GB18030 for Simplified Chinese, etc.

Contributing

Contributions are welcome! Please feel free to open an issue or submit a pull request on GitHub.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

Auto encoding library bytes to strings

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages