Cherish's Blog

obs

Posted on 2019-03-01 Edited on 2023-05-25

Huawei Obs is an object storage service on cloud. Concepts Object 1. The real complete file or byte stream to save 2. object name is the unique id in a bucket 1. it’s used as part of url path. The naming restrictions are fit to url path naming restrictions. 3. Access(based on version in fact) 1. Object ACL: 1. general control to object: read object, read/write object ACL, only users in the same account 2. Object policy 1. fine-grained control to object: fine-grained actions(put,delete…) on object, all users 4. multi-versions 1.

pipeline process: beam

Posted on 2019-01-30 Edited on 2023-05-19

What’s beam beam is a open-source, unified model for defining both batched & streaming data-parallel processing pipelines. * open-source (apache v2 license) * to define data-parallel processing pipelines * an unified model to define pipelines. The real processing is run by the underlying runner (eg. spark, apache apex, etc.). all available runners * can process both batched (bounded datasets) & streaming (unbounded datasets) datasets Use it See the wordcount examples, wordcount src Now we define a simple pipeline and run it. Transform, Count are all built-in atom operations to define t

linux commands

Posted on 2019-01-23 Edited on 2023-02-14

chmod, chown understanding linux file permissions File permissions are defined by permission group and permission type 1. permission group * owner(u) * group(g) * all other users(a) 2. permission type * read (r - 4) * write(w - 2) * execute(x - 1) permission presentation The permission in the command line is displayed as _rwxrwxrwx 1 owner:group * the first character (underscore _ here) is the special permission flag that can vary. * the following three groups of rwx represent permission of owner, group and all other users respectively. If the ow

lombok

Posted on 2019-01-17 Edited on 2022-10-22

lombok is a library to help your write java cleaner and more efficiently. It’s plugged into the editor and build tool, which works at compile time. Essentially, it modifies the byte-codes by operating AST (abstract semantic tree) at compile time, which is allowed by javac. This is, in fact, a way to modify java grammar. Usage To use it, 1. install lombok plugin in intellij 2. add package dependency in project (to use its annotations) 1 2 3 4 5 6 org.projectlombok lombok 1.16.18 provided<

automatic drive

Posted on 2019-01-15 Edited on 2022-10-22

reference: * coco: one format for data labelling

spark

Posted on 2019-01-10 Edited on 2023-12-26

Concept spark is a fast and general-purpose cluster computing system like Hadoop Map-reduce. It runs on the clusters. Spark Ecosystem The components of Apache Spark Ecosystem * spark core: cluster computing system. Provide API to write computing functions. * Spark SQL. SQL for data processing, like hive? * MLlib for machine learning. * GraphX for graph processing * Spark Streaming. Core concepts??? * RDDs (Resilient Distributed Datasets): RDDs are the fundamental data structure in Spark. They are immutable and can be split into multiple partitions that can be processed in parallel.

yarn

Posted on 2019-01-09 Edited on 2022-10-22

yarn architecture Yarn is used to manage/allocate cluster resource & schedule/moniter jobs. These parts – resource manager – are split up from hadoop framework. Yarn has two main components: * Schedular: manage resources (cpu, memory, network, disk, etc.) and allocate it the applications. * node manager will tell Schedular the node resource info (node status) * application master will ask Schedular for resources. * When partitioning resources among various queues, applications, Schedular supports pluggable policies. For example: * CapacityScheduler allocate resources by tenant req

hadoop

Posted on 2019-01-07 Edited on 2023-05-30

Hadoop is a framework of distributed storage & computing. * distributed storage: hadoop use HDFS to save large amount of data in cluster. * distributed computing: hadoop use map-reduce framework to conduct fast data analysis (query & writing) over data in HDFS. * resource manager & job schedular: hadoop use yarn to manage/allocate cluster resources (memory, cpu, etc.) and to schedule and moniter job executing. Architecture cluster architecture request processing Fault Tolerance Use rack aware so that your replicas will be saved into different racks, which can solve the rack failure

hdfs

Posted on 2019-01-07 Edited on 2023-05-25

hdfs architecture HDFS 集群以 master-slave 模型运行。其中有两种节点： * namenode: master node. know where the files are to find in hdfs * datanode: slave node: have the data of the files namenode 参见 namenode and datanode Namenode 管理着文件系统的Namespace。它维护着文件系统树(filesystem tree)以及文件树中所有的文件和文件夹的元数据(metadata)。管理这些信息的文件有两个，分别是Namespace 镜像文件(Namespace image)和操作日志文件(edit log)，这些信息被Cache在RAM中，当然，这两个文件也会被持久化存储在本地硬盘。Namenode记录着每个文件中各个块 (block) 所在的数据节点的位置信息，但是他并不持久化存储这些信息，因为这些信息会在系统启动时从数据节点重建。每个 file 有多个 block 构成，这些 block 分散的存储在各个 datanode 上（并且根据 replication factor，有冗余副本），而 namenode 知道如何一个 file 有哪些 block (file 的

hive introduction

Posted on 2019-01-03 Edited on 2023-05-25

apache hive 是一个 data warehouse 应用。支持分布式存储的大数据读、写和管理，并且支持使用标准的 SQL 语法查询。Hive is not a database. This is to make use of SQL capabilities by defining a metadata to the files in HDFS. Long story short, it brings the possibility to query the hdfs file. hive 并没有固定的数据存储方式。自带的是 csv（comma-separated value）和 tsv (tab-separated values) connectors，也可以使用 connector for other formats。 database v.s. warehouse 参见 the difference between database and data warehouse database：存储具体的业务数据，完善支持 concurrent transaction 操作（CRUD）。 database contains highly detailed data as well as a detailed relational views. Tables ar