Hadoop is a framework of distributed storage & computing. * distributed storage: hadoop use HDFS to save large amount of data in cluster. * distributed computing: hadoop use map-reduce framework to conduct fast data analysis (query & writing) over data in HDFS. * resource manager & job schedular: hadoop use yarn to manage/allocate cluster resources (memory, cpu, etc.) and to schedule and moniter job executing. Architecture cluster architecture request processing Fault Tolerance Use rack aware so that your replicas will be saved into different racks, which can solve the rack failure
Read more »

hdfs architecture HDFS 集群以 master-slave 模型运行。其中有两种节点: * namenode: master node. know where the files are to find in hdfs * datanode: slave node: have the data of the files namenode 参见 namenode and datanode Namenode 管理着文件系统的Namespace。它维护着文件系统树(filesystem tree)以及文件树中所有的文件和文件夹的元数据(metadata)。管理这些信息的文件有两个,分别是Namespace 镜像文件(Namespace image)和操作日志文件(edit log),这些信息被Cache在RAM中,当然,这两个文件也会被持久化存储在本地硬盘。Namenode记录着每个文件中各个块 (block) 所在的数据节点的位置信息,但是他并不持久化存储这些信息,因为这些信息会在系统启动时从数据节点重建。 每个 file 有多个 block 构成,这些 block 分散的存储在各个 datanode 上(并且根据 replication factor,有冗余副本),而 namenode 知道如何一个 file 有哪些 block (file 的
Read more »

apache hive 是一个 data warehouse 应用。支持分布式存储的大数据读、写和管理,并且支持使用标准的 SQL 语法查询。Hive is not a database. This is to make use of SQL capabilities by defining a metadata to the files in HDFS. Long story short, it brings the possibility to query the hdfs file. hive 并没有固定的数据存储方式。自带的是 csv(comma-separated value)和 tsv (tab-separated values) connectors,也可以使用 connector for other formats。 database v.s. warehouse 参见 the difference between database and data warehouse database: 存储具体的业务数据,完善支持 concurrent transaction 操作(CRUD)。 database contains highly detailed data as well as a detailed relational views. Tables ar
Read more »

自适应一般是设定基准值,宽、高、字体大小都指定为基准值的百分比。当基准值改变时,页面元素、宽高也会按比例变化。 自适应宽度 不使用绝对宽度 网页宽度默认等于屏幕宽度。所以大部分时候只要不适用绝对宽度即可实现自适应宽度: 1 2 3 4 body: { width: 100%; // or width: auto; } 如果元素是图片,也可以使用 max-width 属性,参见responsive web design: image 1 2 3 4 img { max-width: 100%; height: auto; } 使用 media 这适用于需要针对不同的屏幕,显示不同的排版。利用 @media 的 css 规则,可实现根据一个或多个基于设备类型、具体特点和环境的媒体查询来应用样式。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 /* Media query */ @media screen and (min-width: 900px) { article { padding: 1rem 3rem; } } /* Nested media query */ @supports (display: flex) { @media screen and (min-width: 90
Read more »

参见 Search and destroy duplicate rows in PostgreSQL Find duplicates using group 1 2 3 4 5 6 7 8 9 SELECT firstname, lastname, count(*) FROM people GROUP BY firstname, lastname HAVING count(*) > 1; using partition 1 2 3 4 5 6 7 8 9 SELECT * FROM (SELECT *, count(*) OVER (PARTITION BY firstname, lastname ) AS count FROM people) tableWithCount WHERE tableWithCount.count > 1; Using not strict distinct 利用 not strict distinct DISTINCT ON 找到唯一的那些条,剩余的就是重复的,可以修改或删除 1 2 3 4 5 6 7 8 9 DELETE FROM people WHERE people.id NOT IN (SELECT id FROM ( SELECT
Read more »

java.io.ByteArrayOutputStream 这一般在用到字节流是会用到。 java performance tuning guide 这篇文章不建议在 performance-criticted 代码中使用 ByteArrayOutputStream: 1. 同步写入,效率低 ByteArrayOutputStream allows you to write anything to an internal expandable byte array and use that array as a single piece of output afterwards. Default buffer size is 32 bytes, so if you expect to write something longer, provide an explicit buffer size in the ByteArrayOutputStream(int) constructor 注: 1. ByteArrayOutputStream 内部是一个可变长度的 byte[](通过扩充实现可变)。它有个初始长度(默认 32),可以在 constructor 中指定. 2. ByteArrayOutputStream 是同步写入,比较影响效率 2. toByteA
Read more »

three reasons why we should not use inheritance in tests 大概意思是: 1. 很多测试里的继承用的不合适。测试也是代码,必须符合继承的原则。 The point of inheritance is to take advantage of polymorphic behavior NOT to reuse code, and people miss that, they see inheritance as a cheap way to add behavior to a class. When I design code I like to think about options. When I inherit, I reduce my options. I am now sub-class of that class and cannot be a sub-class of something else. I have permanently fixed my construction to that of the superclass, and I am at a mercy of the super-class changing APIs. My freedom to change is fixed at co
Read more »

gradle test configurations one sample config ways to improve performance of gradle build common used properties: * jvmArgs: jvm 参数。通常会配置堆栈大小,保证测试对内存的要求。 * '-Xms128m', '-Xmx1024m', '-XX:MaxMetaspaceSize=128m'。-Xms 是初始堆大小,-Xmx 是最大堆大小,-XX:MaxMetaspaceSize 是 class metadata 可占用的最大本地内存(默认是 unlimited)。具体 jvm 参数参考 java doc. * forkEvery: 每个 test process 里跑的 test classes 的最大个数。当次数达到限制后,会自动重启。这定义了一个测试线程什么时候回重启,与并发无关。默认是 0,即无最大限制,就是可以一直跑 * maxParalleForks: 能并发跑的最大 test processes 数目 * systemProperty: 系统属性 * environment:系统环境变量 * include: 具体执行的测试。可以通过这个配置不同的测试级别(单元测试、集成测试、functional 测试……)
Read more »

Synchronous vs multiprocessing vs multithreading vs async Concurrency vs Parralism. asyncio & threading can run multiple I/O operations at the same time. Async runs one block of code at a time while threading just one line of code at a time. With async, we have better control of when the execution is given to other block of code but we have to release the execution ourselves. * IO bound problems: use async if your libraries support it and if not, use threading. * CPU bound problems: use multi-processing. * None above is a problem:you are probably just fine with synchronous code. You may
Read more »
0%