Concept 数据湖 数据湖是: 1. 装有一些便于提取、分析、搜索、挖掘的设备(本身不具备分析能力,是其他分析工具可以方便的在湖上运行,而不需要把湖的数据挪出去再分析) 2. 存放各种数据(格式不统一,原始数据):结构、半结构、非结构化 3. 来源各种各样,能很方便的导入到数据湖 数据湖就是原始数据保存区. 虽然这个概念国内谈的少,但绝大部分互联网公司都已经有了。国内一般把整个HDFS叫做数据仓库(广义),即存放所有数据的地方,而国外一般叫数据湖(data lake)。把需要的数据导入到数据湖,如果你想结合来自数据湖的信息和客户关系管理系统(CRM)里面的信息,我们就进行连接,只有需要时才执行这番数据结合。 数据湖是多结构数据的系统或存储库,它们以原始格式和模式存储,通常作为对象“blob”或文件存储。数据湖的主要思想是对企业中的所有数据进行统一存储,从原始数据(源系统数据的精确副本)转换为用于报告、可视化、分析和机器学习等各种任务的目标数据。数据湖中的数据包括结构化数据(关系数据库数据),半结构化数据(CSV、XML、JSON等),非结构化数据(电子邮件,文档,PDF)和二进制数据(图像、音频、视频),从而形成一个容纳所有形式数据的集中式数据存储。 数据湖从本质上来讲,是一种企业数据架构方法,物理实现上则是一个数据存储平台,用来集中化存储企业内海量的、多来源,
Read more »

learning paths On-premises Env vs Cloud link The term total cost of ownership (TCO) describes the final cost of owning a given technology. In on-premises systems, TCO includes the following costs: * Hardware * Software licensing * Labor (installation, upgrades, maintenance) * Datacenter overhead (power, telecommunications, building, heating and cooling) Cloud systems like Azure track costs by subscriptions. A subscription can be based on usage that’s measured in compute units, hours, or transactions. The cost includes hardware, software, disk storage, and labor. Because of economies of
Read more »

Eli the computer guy Introduction the whole picture Speed & storage unit Physical & logical modem * t1 * dsl: * no faster than 12Mb/s * Asynchronous: download faster than upload * cabel * satellite Router firewall * block the internet to get into your network VPN * enable the internet to get into your network * client-server Switch * Connect everything together 网关(gateway) what’s gateway Generally speaking, any entry to some ‘network’ is called gateway. So for programmers, there’s api gateway, the entry to the backend-service network. In the IP network contex
Read more »

sql performance explained * Read source code of HashMap in c# * Read source code of LinkedList in c# * Unlike the index, the table data is stored in a heap structure and is not sorted at all????? Heap is either min-heap or max-heap, all is partially sorted. Developers need to know this? Sql separate what & how. However, developers needs to know how. Because the access path is what influence the performance most, and developers instead of DBAs know it. Structure (anatomy of an index) The primary purpose of an index is to provide an ordered representation of the indexed data. It is, ho
Read more »

General Concept CPU Core Caching Cache Lines Cache Memory Associative Memory Direct-Mapped Memory Set Associative Memory Cache Read/Write Policies cache coherency MESI protocol: (Modified, Exclusive, Shared, Invalid) * Invalid lines are cache lines that are either not present in the cache, or whose contents are known to be stale. For the purposes of caching, these are ignored. Once a cache line is invalidated, it’s as if it wasn’t in the cache in the first place. * Shared lines are clean copies of the contents of main memory. Cache line
Read more »

Where is my cache for a service [Architectural Patterns for Caching Microservices](Architectural Patterns for Caching Microservices) Patterns: 1. embedded: save cache in the service 2. client-server: a completely separate cache server 3. reverse-proxy: put the cache in front of each service 4. Sidecar: put the cache as a sidecar container that belongs to the service How does cache work? The application receives the request and checks if the same request was already executed (and stored in the cache) Embedded Embedded Distributed Cache Why distributed? 1. Same requests happen on diff
Read more »

Symptoms You may receive the following error message when you create a FOREIGN KEY constraint: (microsoft report) 1 Server: Msg 1785, Level 16, State 1, Line 1 Introducing FOREIGN KEY constraint 'fk_two' on table 'table2' may cause cycles or multiple cascade paths. Specify ON DELETE NO ACTION or ON UPDATE NO ACTION, or modify other FOREIGN KEY constraints. Server: Msg 1750, Level 16, State 1, Line 1 Could not create constraint. See previous errors. For example, the table definition is like this: 1 2 3 4 5 6 7 8 Table t1: Id: primaryKey Table t2: Id: primaryKey parent: Fore
Read more »

playing-nhibernate-inverse-and-cascade, nhibernate-inverse bidirectional associations In database, there may be biodirectional relationships, e.g. Parent has multiple child, and Child has a parent. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 #### class definition class Parent: - String id - IList childs class Child: - String id - Parent parent #### db definition table Parent: - id table Child: - id - parentId Inverse Inverse focus on the association. It defines which side is responsible of the association maintenance (create, update, delete), that is, the
Read more »

the great video History Why does we create internet? It’s created from the need of American military, to protect the communication in the wars. The old communication system, phone, connects Lily with Tom through fixed central offices. If some of the central offices are destroyed by nuclear, then the rerouting of the communication line is difficult, that the commnunication will fail. Ta Da.. Internet comes. It communicates through millions of routers. Even half of the routers are destroyed, there may still be the way to communicate. Why does we create VPN (virtual private network)? Interne
Read more »

nginx 502 和 504 超时演示 502 Bad Gateway: The server was acting as a gateway or proxy and received an invalid response from the upstream server. 504: he server was acting as a gateway or proxy and did not receive a timely response from the upstream server. Conclusion 504 是 nginx 没有及时从上游服务获取响应,超时了: * 上游服务响应慢,读取 response / 发送 request 超时(upstream timed out (110: Operation timed out) **while** reading response header from upstream) * 某些请求处理就是慢。此时就应该调大 proxy_read_timeout (默认 60s) * 上游服务压力太大,响应变慢。此时可以增加上游服务的响应能力,也可以适当提升 proxy_send_timeout, proxy_read_timeout * 连接上游服务超时。可能是上游服务已经断了,但由于
Read more »
0%