Cherish's Blog

HDP install (offline using ambari)

Posted on 2020-11-23 Edited on 2022-10-22

reference 官方安装指导 Preparation 除非说明，默认以下操作都是在所有节点上执行修改 host 1 2 3 4 5 6 7 [root@master ~]# vi /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.105.137 master 192.168.105.191 slave1 192.168.105.13 slave2 修改 network config 1 2 3 4 5 6 7 8 9 10 11 12 [root@master ~]# vi /etc/sysconfig/network # Created by anaconda NETWORKING=yes HOSTNAME=master [root@master ~]# hostnamectl set-hostname master [root@master ~]# hostname master # ping 各个节点，查看是否可连通 [root@maste

atlas

Posted on 2020-11-13 Edited on 2023-05-25

Architecture Install install steps Access Apache Atlas UI using a browser: http://localhost:21000 You can also access the rest api http://localhost:21000/api/atlas/v2 默认的用户名密码是 (admin, admin) Atlas Features 定义元模型，规范元数据 atlas 可以维护（增删改查） metadata types，支持 * 创建多种类型的 metadata types * businessmetadatadef：业务元数据的元模型 * classificationdef：标签数据的元模型 * entitydef：一般元数据的元模型 * enumdef * relationshipdef：关系元数据的元模型 * structdef * 元模型支持定义属性约束、索引、唯一性等 * 按 id/typename/query 来检索相关 API 定义 typedef request schema object 1 2 # DELETE/GET/POST/PUT /v2/types/typedef 约束 * type

数据导入 hive

Posted on 2020-11-11 Edited on 2022-10-22

ftp .csv 文件导入可以先将文件弄到 HDFS，然后创建/更新 hive 表来关联到 HDFS 文件。将文件弄到 HDFS有以下一些方法： 1. ftp -> local -> hdfs: 将文件先下载到本地，再通过 hdfs 命令拷贝到 hdfs 中 2. ftp -> hdfs: 直接连接 FTP，将文件拷到 hdfs 中，省却本地拷贝 3. 已有的数据采集工具：使用实时数据流处理系统，来实现不同系统之间的流通一、ftp -> local ->hdfs 几种方案： 1. hadoop fs -get ftp://uid:password@server_url/file_path temp_file | hadoop fs -moveFromLocal tmp_file hadoop_path/dest_file 2. 参照这个实现用 python 包从 ftp 中读，然后用 hdfs 命令写到 hdfs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 from urllib.request import urlopen

sqoop

Posted on 2020-11-10 Edited on 2022-10-22

Concept Sqoop: sq are the first two of “sql”, oop are the last three of “hadoop”. It transfers bulk data between hdfs and relational database servers. It supports: * Full Load * Incremental Load * Parallel Import/Export (throught mapper jobs) * Compression * Kerberos Security Integration * Data loading directly to HIVE Sqoop cannot import .csv files into hdfs/hive. It only support databases / mainframe datasets import. Architecture Sqoop provides CLI, thus you can use a simple command to conduct import/export. The import/export are executes in fact through map tasks. When Import f

MPP (Massively Parallel Processing)

Posted on 2020-11-10 Edited on 2022-10-22

Concept 5分钟了解MPP数据库 MPP (Massively Parallel Processing)，即大规模并行处理。简单来说，MPP是将任务并行的分散到多个服务器和节点上，在每个节点上计算完成后，将各自部分的结果汇总在一起得到最终的结果(与Hadoop相似，但主要针对大规模关系型数据的分析计算)。 MPP架构特征 * 任务并行执行; * 数据分布式存储(本地化); * 分布式计算; * 私有资源; * 横向扩展; * Shared Nothing架构。 MPPDB v.s. Hadoop 知乎-为什么说HADOOP扩展性优于MPP架构的关系型数据库？ hadoop 和 MPPDB 最大的区别在于：对数据管理理念的不同。 1. HDFS/Hadoop 对于数据管理是粗放型管理，以一个文件系统的模式，让用户根据文件夹层级，把文件直接塞到池子里。处理也以批处理为主，就是拼命 scan。如果想在一大堆数据里找符合条件的数据，hadoop 就是粗暴的把所有文件从头到尾 scan 一遍，因为对于这些文件他没有索引、分类等，他管的少，知道的也少，用的时候每次就要全 scan。 2. 数据库的本质在于数据管理，对外提供在线访问、增删改查等一系列操作。数据库的内存管理比较精细，有一套很完善的数据管理和分布体系。如果想在一大堆数据里找符合条件的数据，他可以根据

data lake

Posted on 2020-11-09 Edited on 2022-10-22

Concept 数据湖数据湖是： 1. 装有一些便于提取、分析、搜索、挖掘的设备（本身不具备分析能力，是其他分析工具可以方便的在湖上运行，而不需要把湖的数据挪出去再分析） 2. 存放各种数据（格式不统一，原始数据）：结构、半结构、非结构化 3. 来源各种各样，能很方便的导入到数据湖数据湖就是原始数据保存区. 虽然这个概念国内谈的少，但绝大部分互联网公司都已经有了。国内一般把整个HDFS叫做数据仓库（广义），即存放所有数据的地方，而国外一般叫数据湖（data lake）。把需要的数据导入到数据湖，如果你想结合来自数据湖的信息和客户关系管理系统（CRM）里面的信息，我们就进行连接，只有需要时才执行这番数据结合。数据湖是多结构数据的系统或存储库，它们以原始格式和模式存储，通常作为对象“blob”或文件存储。数据湖的主要思想是对企业中的所有数据进行统一存储，从原始数据（源系统数据的精确副本）转换为用于报告、可视化、分析和机器学习等各种任务的目标数据。数据湖中的数据包括结构化数据（关系数据库数据），半结构化数据（CSV、XML、JSON等），非结构化数据（电子邮件，文档，PDF）和二进制数据（图像、音频、视频），从而形成一个容纳所有形式数据的集中式数据存储。数据湖从本质上来讲，是一种企业数据架构方法，物理实现上则是一个数据存储平台，用来集中化存储企业内海量的、多来源，

az data engineer certificate

Posted on 2020-10-10 Edited on 2023-05-19

learning paths On-premises Env vs Cloud link The term total cost of ownership (TCO) describes the final cost of owning a given technology. In on-premises systems, TCO includes the following costs: * Hardware * Software licensing * Labor (installation, upgrades, maintenance) * Datacenter overhead (power, telecommunications, building, heating and cooling) Cloud systems like Azure track costs by subscriptions. A subscription can be based on usage that’s measured in compute units, hours, or transactions. The cost includes hardware, software, disk storage, and labor. Because of economies of

network

Posted on 2020-07-23 Edited on 2023-12-08

Eli the computer guy Introduction the whole picture Speed & storage unit Physical & logical modem * t1 * dsl: * no faster than 12Mb/s * Asynchronous: download faster than upload * cabel * satellite Router firewall * block the internet to get into your network VPN * enable the internet to get into your network * client-server Switch * Connect everything together 网关(gateway) what’s gateway Generally speaking, any entry to some ‘network’ is called gateway. So for programmers, there’s api gateway, the entry to the backend-service network. In the IP network contex

Cache Memory

Posted on 2020-06-28 Edited on 2023-05-10

General Concept CPU Core Caching Cache Lines Cache Memory Associative Memory Direct-Mapped Memory Set Associative Memory Cache Read/Write Policies cache coherency MESI protocol: (Modified, Exclusive, Shared, Invalid) * Invalid lines are cache lines that are either not present in the cache, or whose contents are known to be stale. For the purposes of caching, these are ignored. Once a cache line is invalidated, it’s as if it wasn’t in the cache in the first place. * Shared lines are clean copies of the contents of main memory. Cache line

Cache - MicroService

Posted on 2020-06-03 Edited on 2023-09-25

Where is my cache for a service [Architectural Patterns for Caching Microservices](Architectural Patterns for Caching Microservices) Patterns: 1. embedded: save cache in the service 2. client-server: a completely separate cache server 3. reverse-proxy: put the cache in front of each service 4. Sidecar: put the cache as a sidecar container that belongs to the service How does cache work? The application receives the request and checks if the same request was already executed (and stored in the cache) Embedded Embedded Distributed Cache Why distributed? 1. Same requests happen on diff